qwen/qwen3-vl-8b-instruct

qwen/qwen3-vl-8b-instruct

Alibaba’s 8-billion-parameter open-source vision-language model
2025-10-21
LLM
Model capability: imageModel capability: videoModel capability: function_call
Input:
$0.072/1M tokens
Output:
$0.286/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

Qwen3-VL-8B-Instruct is an 8-billion-parameter open-source vision-language model launched by Alibaba. Its core mission is to serve as a high-performance, lightweight multimodal reasoning engine that strikes the perfect balance between visual accuracy and textual robustness.

  • Performance That Defies Expectations: In benchmarks such as STEM tasks and visual question answering, Qwen3-VL-8B-Instruct outperforms Gemini 2.5 Flash Lite and GPT-5 Nano, even rivaling its predecessor, Qwen2.5-VL-72B, which boasts 72 billion parameters.
  • Lower Memory Footprint: Compared to large models, it significantly reduces hardware requirements—single RTX 4090 GPUs can handle smooth inference—and supports quantized versions like FP8, making it adaptable to various devices.
  • Deployment Across All Scenarios: It supports mixed image and text inputs and can be deployed locally on terminals, covering desktops, mobile devices, and intelligent agent environments.
  • Enhanced Video Understanding: It natively supports a 256K context window (expandable up to 1M), enabling it to process videos lasting several hours and achieving second-level timestamp indexing.

───────────────────────────────────────────────────────────────────

Core Capabilities

🚀 Exceptional Long-Text and Video Processing

Natively supporting a 256K context window, combined with Interleaved-MRoPE positional encoding, it effortlessly handles entire books or videos lasting several hours, enabling precise temporal localization.

👁️ Superior Visual and Spatial Perception

Empowered by DeepStack technology, it enhances detail capture and supports both 2D and 3D spatial positioning, accurately determining object locations and occlusion relationships, thereby enabling robot navigation and AR applications.

📝 Professional-Grade OCR and Document Understanding

It supports OCR recognition in 32 languages, achieving a 92% accuracy rate for low-light, blurry, and skewed texts, and can also parse complex charts and document layouts.

🛡️ Secure and Controllable Interaction Experience

Equipped with a built-in multimodal content filtering system, it achieves over 94% success rate in blocking harmful content while maintaining a high response rate to legitimate user commands, ensuring application safety.

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(PPIO)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

qwen/qwen3-vl-8b-instruct

-
131027

Input$0.072 / 1M tokens
Output$0.286 / 1M tokens

Input$0.072/ 1M tokens
Output$0.286/ 1M tokens
Original Price