sophnet/Qwen2.5-VL-72B-Instruct

sophnet/Qwen2.5-VL-72B-Instruct

Top-tier multimodal instruction-finetuned model
2025-07-08
LLM
Model capability: image
Input:
$2.29/1M tokens
Output:
$6.86/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

Qwen2.5-VL-72B-Instruct is Alibaba Tongyi’s flagship multimodal model, primarily positioned as an “enterprise-grade visual agent and all-scenario expert in text-image-video collaboration.” With a massive parameter scale of 72 billion, it delivers state-of-the-art visual understanding, long-video processing, and intelligent device control capabilities, making it well-suited for high-end enterprise tasks and complex intelligent interaction scenarios.

  • Leading Visual Understanding Performance: Its performance on image-related tasks rivals that of international flagship models—achieving 74.8 points on MathVista, 96.4 points on DocVQA, 61.5/63.7 points on OCRBench-V2, and 79.8 points on CC-OCR. It accurately parses complex images such as multilingual documents and charts.
  • Deep Long-Video Processing: It supports the understanding of long videos lasting over one hour, precisely capturing events and pinpointing specific segments. With scores of 73.3/79.1 on VideoMME and 47.3 on LVBench, it is ideally suited for professional video analysis applications.
  • Superior Visual Agent Capabilities: As a high-end visual intelligence agent, it enables automated control of computers and mobile phones. It achieves a score of 87.1 on ScreenSpot and boasts a 93.7% accuracy rate in low-complexity Android Control scenarios, allowing dynamic invocation of tools to handle intricate interactions.
  • Peak Multimodal Collaboration: It deeply integrates text and visual capabilities, supporting structured output (such as invoices and table data), making it ideal for specialized scenarios like finance and business. It excels in optimizing Chinese-language interactions.

───────────────────────────────────────────────────────────────────

Core Capabilities

🖼️ Ultimate Image Parsing: It processes complex multilingual documents, charts, and professional images, extracting highly accurate structured data. Its performance on mathematical vision tasks is top-tier, perfectly meeting enterprise-level document digitization needs.

🎬 Precise Capture of Long-Video Events: It understands the temporal logic of long videos, locates key segments, and generates detailed descriptions and question-and-answer responses, making it ideal for video surveillance analysis and professional content summarization scenarios.

🤖 High-End Device Control Agent: Based on visual environments and instructions, it automates operations on computers and mobile phones, enabling complex interface interactions and boosting the efficiency of enterprise smart-device management.

📊 Professional Structured Data Generation: It converts images such as invoices and reports into standardized data formats, seamlessly integrating with enterprise data systems and catering to specialized scenarios like financial accounting and business data analysis.

🌍 Full-Scenario High-End Adaptation: It supports multiple input formats, including local files, URLs, and base64-encoded data, balancing complex enterprise tasks with high-end personal intelligent interactions to meet diverse and sophisticated demands.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(SophNet)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

sophnet/Qwen2.5-VL-72B-Instruct

-
128000

Input$2.29 / 1M tokens
Output$6.86 / 1M tokens

Input$2.29/ 1M tokens
Output$6.86/ 1M tokens
Original Price