sophnet/Qwen2.5-VL-32B-Instruct

sophnet/Qwen2.5-VL-32B-Instruct

High-performance multimodal instruction-finetuned model
2025-07-08
LLM
Model capability: image
Input:
$1.14/1M tokens
Output:
$3.43/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

Qwen2.5-VL-32B-Instruct is Alibaba Cloud Tongyi’s flagship multimodal model, primarily positioned as a “full-scenario visual agent and expert in text-image-video collaboration.” With a massive 32-billion-parameter scale, it delivers state-of-the-art visual understanding, long-video processing, and device-control capabilities, making it well-suited for enterprise-level complex tasks and high-end intelligent interaction scenarios.

  • State-of-the-art visual understanding performance: Outstanding performance in image-related tasks—achieving 74.7 points on MathVista, 40.0 points on MathVision, 57.2/59.1 points on OCRBenchV2, 77.1 points on CC-OCR (supporting multilingual document parsing), and 94.8 points on DocVQA, meeting the demands of professional document processing.
  • Deep processing of long videos: Supports understanding of videos longer than one hour, enabling precise identification of event segments. With scores of 70.5/77.9 on VideoMME and 1.93 on MMBench-Video, it is ideal for video content analysis and key information extraction.
  • Strong visual-agent capabilities: As a visual intelligence agent, it supports computer and mobile phone control. In low-complexity Android Control scenarios, it achieves an accuracy rate of 93.3%, and ScreenSpot scores 88.5 points, allowing dynamic tool invocation for automated device operations.
  • Enhanced multimodal collaboration: Outstanding text capabilities (78.4 points on MMLU, 91.5 points on Human Eval), supporting cross-modal interactions among text, images, and videos, and generating structured outputs (such as invoices and table data), making it suitable for financial and business scenarios.

───────────────────────────────────────────────────────────────────

Core Capabilities

🖼️ High-precision image parsing: Handles complex images (charts, multilingual documents), extracts structured data, and delivers top-notch performance in mathematical vision tasks, making it ideal for professional document digitization and analysis scenarios.

🎬 Long-video event capture: Understands temporal information in long videos, locates key segments, generates accurate descriptions and answers, and is suited for video surveillance, content summarization, and other similar scenarios.

🤖 Device-control agent: Based on visual environments and instructions, it automates operations on computers and mobile phones, completing interface interactions and function calls, thereby enhancing the efficiency of smart-device usage.

📊 Structured data generation: Converts images such as invoices and tables into standardized data formats, compatible with enterprise data systems, and meets the needs of financial accounting and business data analysis.

🌍 Full-scenario adaptability: Optimized for Chinese-language interactions, supports multiple input formats including local files, URLs, and base64-encoded data, and balances both enterprise-level tasks and high-end personal intelligent interaction needs.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(SophNet)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

sophnet/Qwen2.5-VL-32B-Instruct

-
128000

Input$1.14 / 1M tokens
Output$3.43 / 1M tokens

Input$1.14/ 1M tokens
Output$3.43/ 1M tokens
Original Price