
deepseek/deepseek-ocr-2
API Overview
DeepSeek-OCR 2 is a next-generation multimodal document recognition product launched by DeepSeek AI. Its core positioning is “achieving high-precision document structure understanding through semantic causal scanning,” and it upgrades visual encoding from a fixed grid order to a logic-driven dynamic reordering.
- Key Upgrades: Introduces DeepEncoder V2, replacing the CLIP encoder with the lightweight LLM Qwen2-0.5B, and adds a causal flow query mechanism to enable dynamic reordering of visual tokens.
- Applicable Scenarios: Complex layout document parsing, such as PDFs or scanned documents containing formulas, tables, and multi-column layouts—particularly suitable for scenarios requiring restoration of reading logic.
- Product Value: With only 256–1120 visual tokens, it can highly-fidelity compress entire page content, achieving token efficiency comparable to Gemini-3 Pro.
- Test Data: OmniDocBench v1.5 overall score: 91.09%, an improvement of 3.73% over the previous generation; reading order edit distance reduced from 0.085 to 0.057.
- Real-World Performance: Online service duplication rate dropped from 6.25% to 4.17%, and PDF data processing duplication rate decreased from 3.69% to 2.88%.
───────────────────────────────────────────────────────────────────
Core Capabilities
🔑 Causal Visual Encoding:
By using learnable causal flow query tokens, it simulates human gaze jumps, dynamically reordering visual information to match semantic logic.
⚡ LLM-style Visual Backbone
Adopts Qwen2-0.5B as the encoder, inheriting the efficient inference capability of language models, with total parameters around 500M.
🧩 Mixed Attention Mechanism
Visual tokens use bidirectional attention to maintain global perception, while query tokens employ causal attention to achieve autoregressive reordering.
📊 Two-stage Cascaded Inference
First, the encoder completes “2D→1D” semantic reordering; then, the MoE decoder (total parameters 3B, 500M activations) generates the final content.
🔄 Efficient Compression and Training
Based on the SAM-base tokenizer, it achieves 16x compression, and a three-stage training strategy jointly optimizes the encoding-decoding process.
───────────────────────────────────────────────────────────────────
Principle Breakdown
Playground
Log in to explore more features! Click to Log In