
THUDM/GLM-4.1V-9B-Thinking
API Overview
GLM-4.1V-9B-Thinking is an inference-first multimodal large model released by Zhipu AI (THUDM). Built on the GLM-4-9B architecture, it has approximately 9 billion parameters and is primarily positioned as a high-performance vision-language model that supports chain-of-thought reasoning. It achieves state-of-the-art performance among 10-billion-parameter VLMs, even surpassing its 72-billion-parameter competitors.
- Breakthrough in Reasoning Capabilities: For the first time in the GLM-V series, we introduce the “Thinking Paradigm,” enhancing complex task reasoning through reinforcement learning. In 28 benchmarks, it outperforms 10-billion-parameter models in 23 of them.
- Performance Surpassing Previous Models: On 18 tasks, it surpasses Qwen2.5-VL-72B, demonstrating an efficient design that delivers “great intelligence from a small model.”
- Ultra-Long Context Support: It supports context lengths up to 64K, making it well-suited for complex scenarios such as long documents and multi-turn image-and-text dialogues.
- High-Resolution Compatibility: It accepts images with arbitrary aspect ratios, supporting resolutions up to 4K for more precise detail capture.
- Bilingual Open Source and Open: It offers bilingual Chinese-English understanding and generation capabilities and is open-sourced under the Apache License (the base version, GLM-4.1V-9B-Base, is also available).
───────────────────────────────────────────────────────────────────
Core Capabilities
🧠 Chain-of-Thought Reasoning: Our exclusive “Thinking Mode” significantly improves the accuracy, logic, and interpretability of answers.
📊 SOTA Multimodal Performance: Among 10-billion-parameter models, it ranks first in 23 benchmarks and outperforms 72-billion-parameter models in 18 benchmarks.
🖼️ 4K High-Resolution Understanding: It supports images of any aspect ratio, with maximum input resolution up to 4K, enabling precise interpretation of charts, documents, and scenes.
📚 64K Long Context: It easily handles mixed inputs of multiple images and long texts, making it ideal for applications in education, research, customer service, and more.
🇨🇳 Natively Supports Chinese and English: Optimized specifically for Chinese-language scenarios while maintaining strong English comprehension capabilities.
🔓 Open Source and Commercially Usable: Both the base model and the inference model are open-source, supporting both research and industrial-scale deployment.
Playground
Log in to explore more features! Click to Log In