zai-org/glm-4.5v

zai-org/glm-4.5v

Multimodal Visual Reasoning Model Based on the MOE Architecture
2025-08-25
LLM
Model capability: imageModel capability: thinkingModel capability: function_call
Input:
$0.643/1M tokens
Output:
$1.86/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

GLM-4.5V is a flagship multimodal visual reasoning model launched by Zhipu AI, with a core focus on setting a new open-source benchmark for “all-scenario visual understanding + deep reasoning”.

  • Ultra-large-scale MoE architecture: Total parameters: 10.6B; active parameters: only 1.2B; achieves state-of-the-art (SOTA) performance among open-source models at the same level in 41 public multimodal benchmarks.
  • Full-modal input support: Natively processes images, videos, PDFs/Word documents, and GUI screen captures, enabling unified visual reasoning.
  • Outstanding real-world scenario capabilities: Accurately locates object coordinates, replicates web frontend code, infers the latitude and longitude of street-view shooting locations, and even ranks among the top 66 globally in human competitions.
  • Flexible switching between thinking modes: Supports both “fast response” and “deep reasoning” modes, striking a balance between efficiency and output quality.

───────────────────────────────────────────────────────────────────

Core Capabilities

👁️ Precise Visual Grounding: Outputs precise bounding boxes based on natural language descriptions (e.g., “the second beer bottle from the right”), suitable for industrial applications such as quality inspection and remote sensing.

🖥️ GUI Intelligent Agent Operation: Understands e-commerce pages and office software interfaces, automatically recognizes prices and icons, and executes actions such as clicks and edits.

📄 Deep Interpretation of Complex Documents: Reads and understands images and text like a human, simultaneously comprehending charts, tables, and text without losing information through OCR.

🎬 Long-Video Semantic Reasoning: Analyzes dynamic changes across multiple video frames, reconstructs interaction logic, and even replicates runnable HTML pages from screen recordings.

🧠 World Knowledge Fusion Reasoning: Combines visual clues such as vegetation, buildings, and climate to infer image shooting locations and background information without requiring external searches.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(PPIO)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

zai-org/glm-4.5v

-
65536

Input$0.643 / 1M tokens
Output$1.86 / 1M tokens

Input$0.643/ 1M tokens
Output$1.86/ 1M tokens
Original Price