qwen3-vl-235b-a22b-instruct

qwen3-vl-235b-a22b-instruct

The flagship multimodal mixture-of-experts (MoE) model launched by the Tongyi Qianwen.
2025-09-24
LLM
Model capability: image
Input:
$0.286/1M tokens
Output:
$1.143/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

Qwen3-VL-235B-A22B-Instruct is a super-large-scale multimodal instruction-tuned model released by Alibaba’s Tongyi Lab, primarily positioned as the “most powerful open-source visual-language intelligence foundation model,” specifically designed for complex image-and-text understanding and tool-collaboration scenarios.

  • Flagship MoE Architecture: With a total of 235 billion parameters, it activates only 22 billion parameters, striking a balance between cutting-edge multimodal capabilities and efficient inference costs.
  • Natively Supports Long Videos and Documents: It can handle various types of inputs, including images, videos, PDFs, web page screenshots, and more, supporting ultra-long context fusion and analysis in a single inference.
  • Comprehensively Enhanced Agent Capabilities: It leads in tasks such as GUI operations, frontend code generation, and chart-based question answering, and supports Function Calling and structured output.
  • Built-in Deep Thinking Mode: It can automatically enable Chain-of-Thought reasoning, breaking down complex visual tasks into step-by-step subtasks and solving them sequentially.

───────────────────────────────────────────────────────────────────

Core Capabilities

👁️ Pixel-Level Semantic Understanding: Accurately identifies interface elements, chart data, document layouts, and correlates them with text instructions for high-level reasoning.

🧠 Autonomous Task Planning: When faced with tasks such as “writing e-commerce detail pages based on product images” or “generating analytical reports from financial statement screenshots,” it can automatically plan the process of parsing → extraction → generation.

🌍 Multi-Language Image and Text Generation: It supports image description, interpretation, and creation in multiple languages including Chinese and English, producing outputs that are culturally appropriate and contextually relevant.

🧩 Seamless Agent Collaboration: Natively compatible with tool-call protocols, it can directly drive browsers, code executors, or design software to build end-to-end automated workflows.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat (Tongyi Qianwen-OCR)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

qwen3-vl-235b-a22b-instruct

-
126976

Input$0.286 / 1M tokens
Output$1.143 / 1M tokens

Input$0.286/ 1M tokens
Output$1.143/ 1M tokens
Original Price