
sophnet/Qwen2.5-VL-32B-Instruct
API Overview
Qwen2.5-VL-32B-Instruct is Alibaba Cloud Tongyi’s flagship multimodal model, primarily positioned as a “full-scenario visual agent and expert in text-image-video collaboration.” With a massive 32-billion-parameter scale, it delivers state-of-the-art visual understanding, long-video processing, and device-control capabilities, making it well-suited for enterprise-level complex tasks and high-end intelligent interaction scenarios.
- State-of-the-art visual understanding performance: Outstanding performance in image-related tasks—achieving 74.7 points on MathVista, 40.0 points on MathVision, 57.2/59.1 points on OCRBenchV2, 77.1 points on CC-OCR (supporting multilingual document parsing), and 94.8 points on DocVQA, meeting the demands of professional document processing.
- Deep processing of long videos: Supports understanding of videos longer than one hour, enabling precise identification of event segments. With scores of 70.5/77.9 on VideoMME and 1.93 on MMBench-Video, it is ideal for video content analysis and key information extraction.
- Strong visual-agent capabilities: As a visual intelligence agent, it supports computer and mobile phone control. In low-complexity Android Control scenarios, it achieves an accuracy rate of 93.3%, and ScreenSpot scores 88.5 points, allowing dynamic tool invocation for automated device operations.
- Enhanced multimodal collaboration: Outstanding text capabilities (78.4 points on MMLU, 91.5 points on Human Eval), supporting cross-modal interactions among text, images, and videos, and generating structured outputs (such as invoices and table data), making it suitable for financial and business scenarios.
───────────────────────────────────────────────────────────────────
Core Capabilities
🖼️ High-precision image parsing: Handles complex images (charts, multilingual documents), extracts structured data, and delivers top-notch performance in mathematical vision tasks, making it ideal for professional document digitization and analysis scenarios.
🎬 Long-video event capture: Understands temporal information in long videos, locates key segments, generates accurate descriptions and answers, and is suited for video surveillance, content summarization, and other similar scenarios.
🤖 Device-control agent: Based on visual environments and instructions, it automates operations on computers and mobile phones, completing interface interactions and function calls, thereby enhancing the efficiency of smart-device usage.
📊 Structured data generation: Converts images such as invoices and tables into standardized data formats, compatible with enterprise data systems, and meets the needs of financial accounting and business data analysis.
🌍 Full-scenario adaptability: Optimized for Chinese-language interactions, supports multiple input formats including local files, URLs, and base64-encoded data, and balances both enterprise-level tasks and high-end personal intelligent interaction needs.
Playground
Log in to explore more features! Click to Log In