
qwen/qwen2.5-vl-72b-instruct
API Overview
Qwen2.5-VL-72B-Instruct is Alibaba Cloud’s top-tier multimodal model, positioned as an “enterprise-grade visual agent and all-scenario expert in text, image, and video collaboration.” With a massive parameter scale of 72 billion, it delivers state-of-the-art visual understanding, long-video processing, and intelligent device-control capabilities, making it well-suited for high-end enterprise tasks and complex intelligent interaction scenarios.
- Leading Visual Understanding Performance: Its performance on image-related tasks rivals that of international flagship models—achieving 74.8 points on MathVista, 96.4 points on DocVQA, 61.5/63.7 points on OCRBench-V2, and 79.8 points on CC-OCR. It accurately parses complex images such as multilingual documents and charts.
- Deep Long-Video Processing: It supports the understanding of long videos exceeding one hour in length, precisely capturing events and pinpointing specific segments. With scores of 73.3/79.1 on VideoMME and 47.3 on LVBench, it is ideal for professional video-analysis applications.
- Superior Visual-Agent Capabilities: As a high-end visual intelligence agent, it enables automated control of computers and mobile phones. ScreenSpot achieves a score of 87.1, and Android Control demonstrates 93.7% accuracy in low-complexity scenarios, allowing dynamic invocation of tools to handle intricate interactions.
- Peak Multimodal Collaboration: It deeply integrates text and visual capabilities, supporting structured output (such as invoices and table data), making it suitable for specialized scenarios like finance and business. Its Chinese-language interaction has been particularly optimized.
───────────────────────────────────────────────────────────────────
Core Capabilities
🖼️ Ultimate Image Parsing: Handles complex multilingual documents, charts, and professional images, extracting highly accurate structured data. Its performance on mathematical vision tasks is top-notch, perfectly meeting enterprise-level document-digitization needs.
🎬 Precise Capture of Long-Video Events: Understands the temporal logic of long videos, locates key segments, and generates detailed descriptions and Q&A, making it ideal for video surveillance analysis and professional content summarization scenarios.
🤖 High-End Device-Control Agent: Based on visual environments and instructions, it automates operations on computers and mobile phones, enabling complex interface interactions and boosting the efficiency of enterprise smart-device management.
📊 Professional Structured Data Generation: Converts images such as invoices and reports into standardized data formats, seamlessly integrating with enterprise data systems and catering to specialized scenarios like financial accounting and business data analysis.
🌍 Full-Scenario High-End Adaptation: Compatible with multiple input formats—including local files, URLs, and base64—this model balances complex enterprise tasks with high-end personal intelligent interactions, satisfying diverse and sophisticated user demands.
Playground
Log in to explore more features! Click to Log In