glm-5v-turbo

glm-5v-turbo

GLM-5V-Turbo is Zhipu’s first multimodal coding foundation model, designed specifically for visual programming tasks. It can natively process multimodal inputs such as images, videos, and text, and excels at long-term planning, complex programming, and action execution. It is deeply integrated with agent workflows, enabling seamless collaboration with agents like Claude Code and OpenClaw to complete the full closed-loop process of “understanding the environment → planning actions → executing tasks.”
2026-04-02
LLM
Model capability: audioModel capability: imageModel capability: videoModel capability: thinkingModel capability: function_call
Input:
$0.72/1M tokensstarting from
Output:
$3.2/1M tokensstarting from
Bulk order? Contact your manager for exclusive deals
稳定性
Stable

API Overview

GLM-5V-Turbo is an ultra-fast optimized version of GLM-5V, Zhipu AI’s flagship multimodal model. It integrates the powerful logical reasoning capabilities of the GLM-5 series with cutting-edge visual perception technologies, aiming to provide developers with ultra-low-latency, high-precision multimodal interaction experiences. Whether it’s complex chart analysis, real-time screen understanding, or semantic summarization of video content, GLM-5V-Turbo can perform deep processing of visual information at astonishing speeds, making it the core engine for building the next-generation “vision-driven” intelligent agents (Vision-Agents). ───────────────────────────────────────────────────────────────────

Core Capabilities


Ultra-fast Visual Perception Speed: Deeply optimized for the visual processing pipeline, this significantly reduces the response time from image input to semantic output, making it ideal for business scenarios that require real-time visual feedback. Multimodal Deep Logical Reasoning: Going beyond mere image recognition, it boasts strong spatial awareness and logical association capabilities, enabling precise interpretation of complex charts, technical documents, or multi-screen information and transforming them into structured task instructions. High-performance Visual Agent Support: Perfectly compatible with intelligent agent orchestration frameworks such as OpenClaw, it can perform actions like clicking and dragging based on visual observations, making it a key tool for achieving “visual interaction automation.” Precise High-fidelity Description: It demonstrates extremely high accuracy and robustness in fine-grained visual recognition, complex scene description, and tasks involving long texts and image associations, reducing agent execution deviations caused by misinterpretations.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat (Zhipu GLM Multimodal)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

glm-5v-turbo

Input length [0, 32k]
200000

Input$0.72 / 1M tokens
Output$3.2 / 1M tokens

Input$0.72/ 1M tokens
Output$3.2/ 1M tokens
Original Price

glm-5v-turbo

Input length [32k, 200k]
200000

Input$1.1 / 1M tokens
Output$3.8 / 1M tokens

Input$1.1/ 1M tokens
Output$3.8/ 1M tokens
Original Price