glm-4v

glm-4v

The open-source multimodal model launched by Zhipu AI, based on the MOE architecture.
2024-06-05
LLM
Model capability: image
Input:
$7/1M tokens
Output:
$7/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

GLM-4V is a flagship multimodal model launched by Zhipu AI. Its core positioning is as a vision-understanding engine based on the MOE architecture, featuring 1120×1120 high-resolution input and a deep-thinking mode to achieve precise analysis of images, videos, and documents, as well as cross-modal task processing.

  • Multimodal Fusion: Supports three types of inputs—images, videos, and files—and delivers accurate text-analysis results, covering enterprise-level scenarios such as front-end replication and security quality inspection.
  • High-Resolution Processing: Exclusively supports 1120×1120 resolution input; uses downsampling technology to reduce token overhead and enhance parsing accuracy.
  • Performance on Par with GPT-4V: Outperforms comparable open-source models in multiple evaluations, achieving SOTA-level overall performance, especially excelling in complex visual reasoning tasks.
  • Structured Output: Natively supports JSON format, enabling direct integration with business systems and reducing secondary development efforts.
  • Enterprise-Level Customization: Supports LoRA fine-tuning, increasing model availability from 60% to 89%.
  • Open-Source Ecosystem: The open-source version has been downloaded over 13 million times, ranking first among domestically developed models, and supports developers in deploying lightweight applications locally.

───────────────────────────────────────────────────────────────────

Core Capabilities

🧠 MOE Architecture Powered: With a total parameter count of 10.6B and 1.2B activated parameters, it achieves the highest performance among comparable open-source models, boosting visual reasoning efficiency by 40%.

🔍 Full-Modal Analysis: Exclusively supports three modalities—video, image, and file—as input, enabling multi-source information fusion analysis in a single call.

⚡ Deep-Thinking Mode: Dynamically activates complex reasoning chains to tackle advanced tasks such as subject-specific problem-solving and logical deduction, with an accuracy rate exceeding 92%.

🌐 High-Resolution Processing: With 1120×1120 input resolution and intelligent downsampling, detail-capturing capability improves by 50%, minimizing information loss.

🛠️ Automated Execution: A GUI Agent precisely identifies interface elements and automatically performs office operations such as PPT editing and data entry.

📊 Structured Output: Natively supports JSON format and coordinate-based localization (such as Grounding), directly generating interactive code or structured data.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat (Zhipu GLM-4V)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

glm-4v

-
32000

Input$7 / 1M tokens
Output$7 / 1M tokens

Input$7/ 1M tokens
Output$7/ 1M tokens
Original Price