qwen2-vl-72b-instruct

qwen2-vl-72b-instruct

From Tongyi Qianwen, with context expanded to 32k, enhanced image understanding capabilities, and improved ability to recognize multilingual text and handwriting in images.
2024-09-04
LLM
Model capability: image
Input:
$2.29/1M tokens
Output:
$6.86/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

Qwen2-VL-72B-Instruct is a high-performance multimodal instruction-tuned model released by Alibaba’s Tongyi Lab. Its core positioning is as a flagship multimodal foundation model that delivers powerful visual-language understanding combined with professional-grade text-and-image interaction.

  • Ultra-large-scale dense architecture: With 72 billion parameters fully activated, it achieves state-of-the-art performance among open-source models in authoritative multimodal benchmarks such as MMMU, MathVista, and DocVQA.
  • Natively supports multimodal inputs: It can directly process complex visual content including images, videos, PDFs, web page screenshots, and more, while seamlessly integrating with long-form text.
  • Long-duration video understanding: It can analyze videos longer than 20 minutes, enabling tasks such as video question answering, content creation, and conversational interactions.
  • Fine-grained alignment with instructions: Trained on high-quality human preference data, it precisely responds to complex instructions involving format control, style imitation, multi-step operations, and more.

───────────────────────────────────────────────────────────────────

Core Capabilities

👁️ Expert-level visual parsing: Accurately understands highly information-dense content such as academic charts, engineering drawings, financial reports, and UI interfaces, and extracts structured data from them.

🧠 Cross-modal deep reasoning: Combines visual and textual contexts to accomplish tasks like “writing debugging steps based on circuit diagrams” or “generating shopping recommendations from product comparison images.”

🌍 Natural multilingual output: Strongly enhances Chinese context understanding while supporting mainstream languages such as English, ensuring outputs are aligned with local cultural norms and professional conventions.

🧩 Agent-ready integration: Natively supports Function Calling and JSON Schema output, allowing seamless integration into AI workflows for automated office applications, AI tutoring, e-commerce shopping guides, and more.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat (Tongyi Qianwen)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

qwen2-vl-72b-instruct

-
32000

Input$2.29 / 1M tokens
Output$6.86 / 1M tokens

Input$2.29/ 1M tokens
Output$6.86/ 1M tokens
Original Price