
sophnet/Qwen2.5-VL-72B-Instruct
API Overview
Qwen2.5-VL-72B-Instruct is Alibaba Tongyi’s flagship multimodal model, primarily positioned as an “enterprise-grade visual agent and all-scenario expert in text-image-video collaboration.” With a massive parameter scale of 72 billion, it delivers state-of-the-art visual understanding, long-video processing, and intelligent device control capabilities, making it well-suited for high-end enterprise tasks and complex intelligent interaction scenarios.
- Leading Visual Understanding Performance: Its performance on image-related tasks rivals that of international flagship models—achieving 74.8 points on MathVista, 96.4 points on DocVQA, 61.5/63.7 points on OCRBench-V2, and 79.8 points on CC-OCR. It accurately parses complex images such as multilingual documents and charts.
- Deep Long-Video Processing: It supports the understanding of long videos lasting over one hour, precisely capturing events and pinpointing specific segments. With scores of 73.3/79.1 on VideoMME and 47.3 on LVBench, it is ideally suited for professional video analysis applications.
- Superior Visual Agent Capabilities: As a high-end visual intelligence agent, it enables automated control of computers and mobile phones. It achieves a score of 87.1 on ScreenSpot and boasts a 93.7% accuracy rate in low-complexity Android Control scenarios, allowing dynamic invocation of tools to handle intricate interactions.
- Peak Multimodal Collaboration: It deeply integrates text and visual capabilities, supporting structured output (such as invoices and table data), making it ideal for specialized scenarios like finance and business. It excels in optimizing Chinese-language interactions.
───────────────────────────────────────────────────────────────────
Core Capabilities
🖼️ Ultimate Image Parsing: It processes complex multilingual documents, charts, and professional images, extracting highly accurate structured data. Its performance on mathematical vision tasks is top-tier, perfectly meeting enterprise-level document digitization needs.
🎬 Precise Capture of Long-Video Events: It understands the temporal logic of long videos, locates key segments, and generates detailed descriptions and question-and-answer responses, making it ideal for video surveillance analysis and professional content summarization scenarios.
🤖 High-End Device Control Agent: Based on visual environments and instructions, it automates operations on computers and mobile phones, enabling complex interface interactions and boosting the efficiency of enterprise smart-device management.
📊 Professional Structured Data Generation: It converts images such as invoices and reports into standardized data formats, seamlessly integrating with enterprise data systems and catering to specialized scenarios like financial accounting and business data analysis.
🌍 Full-Scenario High-End Adaptation: It supports multiple input formats, including local files, URLs, and base64-encoded data, balancing complex enterprise tasks with high-end personal intelligent interactions to meet diverse and sophisticated demands.
Playground
Log in to explore more features! Click to Log In