qwen/qwen2.5-vl-72b-instruct

qwen/qwen2.5-vl-72b-instruct

Top-tier multimodal instruction-finetuned model
2025-02-21
LLM
Input:
$0.6/1M tokens
Output:
$0.6/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

Qwen2.5-VL-72B-Instruct is Alibaba Cloud’s top-tier multimodal model, positioned as an “enterprise-grade visual agent and all-scenario expert in text, image, and video collaboration.” With a massive parameter scale of 72 billion, it delivers state-of-the-art visual understanding, long-video processing, and intelligent device-control capabilities, making it well-suited for high-end enterprise tasks and complex intelligent interaction scenarios.

  • Leading Visual Understanding Performance: Its performance on image-related tasks rivals that of international flagship models—achieving 74.8 points on MathVista, 96.4 points on DocVQA, 61.5/63.7 points on OCRBench-V2, and 79.8 points on CC-OCR. It accurately parses complex images such as multilingual documents and charts.
  • Deep Long-Video Processing: It supports the understanding of long videos exceeding one hour in length, precisely capturing events and pinpointing specific segments. With scores of 73.3/79.1 on VideoMME and 47.3 on LVBench, it is ideal for professional video-analysis applications.
  • Superior Visual-Agent Capabilities: As a high-end visual intelligence agent, it enables automated control of computers and mobile phones. ScreenSpot achieves a score of 87.1, and Android Control demonstrates 93.7% accuracy in low-complexity scenarios, allowing dynamic invocation of tools to handle intricate interactions.
  • Peak Multimodal Collaboration: It deeply integrates text and visual capabilities, supporting structured output (such as invoices and table data), making it suitable for specialized scenarios like finance and business. Its Chinese-language interaction has been particularly optimized.

───────────────────────────────────────────────────────────────────

Core Capabilities

🖼️ Ultimate Image Parsing: Handles complex multilingual documents, charts, and professional images, extracting highly accurate structured data. Its performance on mathematical vision tasks is top-notch, perfectly meeting enterprise-level document-digitization needs.

🎬 Precise Capture of Long-Video Events: Understands the temporal logic of long videos, locates key segments, and generates detailed descriptions and Q&A, making it ideal for video surveillance analysis and professional content summarization scenarios.

🤖 High-End Device-Control Agent: Based on visual environments and instructions, it automates operations on computers and mobile phones, enabling complex interface interactions and boosting the efficiency of enterprise smart-device management.

📊 Professional Structured Data Generation: Converts images such as invoices and reports into standardized data formats, seamlessly integrating with enterprise data systems and catering to specialized scenarios like financial accounting and business data analysis.

🌍 Full-Scenario High-End Adaptation: Compatible with multiple input formats—including local files, URLs, and base64—this model balances complex enterprise tasks with high-end personal intelligent interactions, satisfying diverse and sophisticated user demands.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(PPIO)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

qwen/qwen2.5-vl-72b-instruct

-
32000

Input$0.6 / 1M tokens
Output$0.6 / 1M tokens

Input$0.6/ 1M tokens
Output$0.6/ 1M tokens
Original Price