deepseek/deepseek-ocr-2

deepseek/deepseek-ocr-2

DeepSeek AI’s multimodal document recognition model achieves high-precision understanding of document structure through semantic causal scanning.
2026-02-04
LLM
Model capability: image
Input:
$0.031/1M tokens
Output:
$0.031/1M tokens
Bulk order? Contact your manager for exclusive deals

API Overview

DeepSeek-OCR 2 is a next-generation multimodal document recognition product launched by DeepSeek AI. Its core positioning is “achieving high-precision document structure understanding through semantic causal scanning,” and it upgrades visual encoding from a fixed grid order to a logic-driven dynamic reordering.

  • Key Upgrades: Introduces DeepEncoder V2, replacing the CLIP encoder with the lightweight LLM Qwen2-0.5B, and adds a causal flow query mechanism to enable dynamic reordering of visual tokens.
  • Applicable Scenarios: Complex layout document parsing, such as PDFs or scanned documents containing formulas, tables, and multi-column layouts—particularly suitable for scenarios requiring restoration of reading logic.
  • Product Value: With only 256–1120 visual tokens, it can highly-fidelity compress entire page content, achieving token efficiency comparable to Gemini-3 Pro.
  • Test Data: OmniDocBench v1.5 overall score: 91.09%, an improvement of 3.73% over the previous generation; reading order edit distance reduced from 0.085 to 0.057.
  • Real-World Performance: Online service duplication rate dropped from 6.25% to 4.17%, and PDF data processing duplication rate decreased from 3.69% to 2.88%.

───────────────────────────────────────────────────────────────────

Core Capabilities

🔑 Causal Visual Encoding:

By using learnable causal flow query tokens, it simulates human gaze jumps, dynamically reordering visual information to match semantic logic.

LLM-style Visual Backbone

Adopts Qwen2-0.5B as the encoder, inheriting the efficient inference capability of language models, with total parameters around 500M.

🧩 Mixed Attention Mechanism

Visual tokens use bidirectional attention to maintain global perception, while query tokens employ causal attention to achieve autoregressive reordering.

📊 Two-stage Cascaded Inference

First, the encoder completes “2D→1D” semantic reordering; then, the MoE decoder (total parameters 3B, 500M activations) generates the final content.

🔄 Efficient Compression and Training

Based on the SAM-base tokenizer, it achieves 16x compression, and a three-stage training strategy jointly optimizes the encoding-decoding process.

───────────────────────────────────────────────────────────────────

Principle Breakdown

Playground

Log in to explore more features! Click to Log In

API Analytics

API Reference (1)

API DescriptionAPI EndpointRequest MethodStabilityParameter Description
Chat(PPIO)
POST
Stable
View Details

API Pricing

$
ModelDescriptionContextOfficial Price302.AI Price

deepseek/deepseek-ocr-2

-
8192

Input$0.031 / 1M tokens
Output$0.031 / 1M tokens

Input$0.031/ 1M tokens
Output$0.031/ 1M tokens
Original Price