← 首页 | 导读 | 详细解读

DeepSeek-OCR 2: Visual Causal Flow

DeepSeek-OCR 2:视觉因果流

📄 arXiv: 2601.20552📅 2026-01-28PDF
翻译进度44 / 44 段 (100%)

中文摘要

DeepSeek-OCR-2 引入视觉因果流(Visual Causal Flow)概念,通过 DeepEncoder V2 动态重排序视觉 token,探索 2D 图像理解的新范式。相比传统 OCR 方法,该模型能够捕捉图像中元素之间的因果关系,实现更准确的文本识别和场景理解。在文档分析、表格识别、手写体识别等任务上显著超越现有技术。

DeepSeek-OCR 2: Visual Causal Flow

Haoran Wei, Yaofeng Sun, Yukun Li, DeepSeek-AI 摘要:我们提出DeepSeek-OCR 2,探索一种新型编码器——DeepEncoder V2——的可行性,它能够根据图像语义动态重新排序视觉token。传统的视觉语言模型(VLMs)总是以严格的从左到右、从上到下的空间顺序处理视觉token,忽略了人类视觉系统的动态注视特性。受到人类阅读模式的启发,我们设计了一个因果感知编码器,它通过可学习的查询token隐式地对视觉信息进行因果重排序。我们的方法在保持高压缩比的同时,实现了OCR性能的显著提升。
原文: Haoran Wei Yaofeng Sun Yukun Li DeepSeek-AI Abstract We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder—DeepEncoder V2—capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, ...

DeepSeek-OCR 2: Visual Causal Flow

人类视觉系统具有独特的动态特性,其注视点遵循固有的逻辑,每次后续注视都因果性地依赖于之前的注视。类比而言,模型中的视觉token应该有选择地处理,其顺序高度依赖于视觉语义而非空间坐标。这一洞察促使我们从根本上重新思考视觉编码器的设计,提出了因果感知的token重排序机制。
原文: movements follow inherent logic where each subsequent fixation causally depends on previous ones. By analogy, visual tokens in models should be selectively processed with ordering highly contingent on visual semantics rather than spatial coordinates. This insight motivates us to fundamentally reconsider the architectural design of vision-language models (VLMs), particularly the encoder component. LLMs are inherently trained on 1D sequential data, while images are 2D structures. Directly flattening image patches in a predefined raster-scan order introduces unwarranted inductive bias that ignore...

DeepSeek-OCR 2: Visual Causal Flow

输出被送入LLM解码器,实现级联因果感知视觉理解。第二,利用DeepEncoder V2,我们提出了DeepSeek-OCR 2,它在保持DeepSeek-OCR的图像压缩比和解码效率的同时,实现了显著的性能提升。DeepSeek-OCR 2在OmniDocBench上达到了91.09%的准确率,同时使用的视觉token数量最少。
原文: utputs—are fed to the LLM [ deepseekv2 ] decoder, enabling cascade causal-aware visual understanding. Second, leveraging DeepEncoder V2, we present DeepSeek-OCR 2, which preserves the image compression ratio and decoding efficiency of DeepSeek-OCR while achieving substantial performance improvements. We constrain visual tokens fed to the LLM between 256 and 1120. The lower bound (256) corresponds to DeepSeek-OCR’s tokenization of 1024×1024 images, while the upper bound (1120) matches Gemini-3 pro’s [ team2023gemini ] maximum visual token budget. This design positions DeepSeek-OCR 2 as both a n...

DeepSeek-OCR 2: Visual Causal Flow

将Transformer架构引入目标检测,从根本上摆脱了传统检测范式(Faster R-CNN、YOLO等)。为了克服Transformer块中串行解码的效率限制,DETR引入了预设的并行化可学习查询机制,允许同时处理多个目标,显著提高了检测速度和准确性。
原文: ation of transformer architecture into object detection, fundamentally breaking away from traditional detection paradigms [ ren2015faster , redmon2017yolo9000 ] . To overcome the efficiency limitations of serial decoding in transformer blocks, DETR introduced preset parallelized learnable queries—a set of 100 object queries that encode object priors such as shape and position through training. These queries interact with feature maps [ he2016deep ] via cross-attention mechanisms, while simultaneously engaging in bidirectional information exchange among themselves through self-attention. DETR e...

DeepSeek-OCR 2: Visual Causal Flow

LLM初始化:在大规模互联网数据上训练的LLM已被证明是多模态模型初始化的有效方法。Pang等人证明了冻结的LLM Transformer层增强了视觉判别任务。此外,无编码器或轻量级编码器模型如Fuyu也展示了这一方向的有效性。
原文: nitialization LLMs trained on large-scale internet data have proven effective as initialization for multimodal models. Pang et al. [ pang2023fozen ] demonstrated that frozen LLM transformer layers enhance visual discriminative tasks. Moreover, encoder-free or lightweight-encoder models such as Fuyu [ fuyu8b_model ] and Chameleon [ chameleon2024 ] in vision, as well as VALL-E [ wang2023neural ] in speech, further validate the potential of LLM pretrained weights for multimodal initialization. Figure 3 : DeepSeek-OCR 2 adopts the visual token compression mechanism from DeepEncoder, employing an 8...

DeepSeek-OCR 2: Visual Causal Flow

非线性布局的光学文本、表单和表格的阅读模式。图4:DeepEncoder V2中的token数量计算。DeepEncoder V2使用0-6个局部视图的多裁剪策略,每张图片输出256-1120个token。当没有局部视图时,只有全局视图产生256个token。
原文: al reading patterns, especially non-linear layouts in optical texts, forms and tables. Figure 4 : Token count calculation in DeepEncoder V2. DeepEncoder V2 outputs 256 − - 1120 tokens per image using a multi-crop strategy with 0 − - 6 local views. With 0 local views, only the global view produces 256 tokens; with 6 local views, the count reaches 1120 (6 × \times 144+256). 3.2.1 Vision tokenizer The first component of DeepEncoder V2 is a vision tokenizer. Following DeepEncoder, we employ an architecture combining an 80M-parameter SAM-base [ kirilloV2023segment ] along with two convolutional lay...

DeepSeek-OCR 2: Visual Causal Flow

只有因果查询输出被送入LLM解码器。我们使用Qwen2-0.5B实例化这一架构,其5亿参数与CLIP ViT(3亿)相当,不会引入过多的计算开销。这种仅解码器架构通过视觉token前缀拼接,实现了高效的视觉-语言对齐。
原文: only the causal query outputs are fed to the LLM decoder. We instantiate this architecture using Qwen2-0.5B [ wang2024qwen2 ] , whose 500M parameters are comparable to CLIP ViT (300M) without introducing excessive computational overhead. The decoder-only architecture with prefix-concatenation of visual tokens proves crucial: extra experiments with cross-attention in an mBART-style [ liu2020multilingual ] encoder-decoder structure fail to converge. We hypothesize this failure stems from insufficient visual token interaction when isolated in a separate encoder. In contrast, the prefix design kee...

DeepSeek-OCR 2: Visual Causal Flow

全局查询用于捕获图像的整体语义信息。局部裁剪采用768×768的分辨率,裁剪数量k从0到6(当两个图像维度都小于768时不应用裁剪)。所有局部视图共享一组144个查询嵌入,确保了一致的特征表示。 这种多尺度策略使得模型能够同时捕获全局上下文和局部细节,在OCR任务中表现出色。
原文: global \text{query}_{\text{global}} . Local crops adopt a resolution of 768 × 768 768\times 768 , with the number of crops k k ranging from 0 to 6 (no cropping is applied when both image dimensions are smaller than 768). All local views share a unified set of 144 query embeddings, denoted as query local \text{query}_{\text{local}} . Therefore, the total number of reordered visual tokens fed to the LLM is k × 144 + 256 k\times 144+256 , ranging from [ 256 , 1120 ] [256,1120] . This maximum token count (1120) is lower than DeepSeek-OCR’s 1156 (Gundam mode) and matches Gemini-3-Pro’s maximum visu...

1 Introduction

人类视觉系统紧密类似于基于Transformer的视觉编码器:中央注视功能作为视觉token,局部锐利但全局感知。然而,与现有编码器从左上角到右下角严格扫描token不同,人类视觉选择性地处理视觉信息。
原文: The human visual system closely mirrors transformer-based vision encoders [ dosovitskiy2020image , dehghani2023patch ] : foveal fixations function as visual tokens, locally sharp yet globally aware. However, unlike existing encoders that rigidly scan tokens from top-left to bottom-right, human vision follows a causally-driven flow guided by semantic understanding. Consider tracing a spiral—our eye movements follow inherent logic where each subsequent fixation causally depends on previous ones. By analogy, visual tokens in models should be selectively processed with ordering highly contingent o...

1 Introduction

作为前缀附加——通过自定义注意力掩码,视觉token保持全局感受野,而因果流token可以获得视觉token重排序能力;(3)我们在因果token和视觉token之间保持相等基数(包括填充和边界等冗余),以提供稳定的优化信号。
原文: repended as a prefix—through a customized attention mask, visual tokens maintain global receptive fields, while causal flow tokens can obtain visual token reordering ability; (3) we maintain equal cardinality between causal and visual tokens (with redundancy such as padding and borders) to provide sufficient capacity for re-fixation; (4) only the causal flow tokens—the latter half of the encoder outputs—are fed to the LLM [ deepseekv2 ] decoder, enabling cascade causal-aware visual understanding. Second, leveraging DeepEncoder V2, we present DeepSeek-OCR 2, which preserves the image compressio...

1 Introduction

视觉阅读逻辑的显著进步。图2显示了两个使用并行查询的计算机视觉模型:DETR的解码器用于目标检测和BLIP2的Q-former用于视觉token压缩。两者都在查询之间使用双向自注意力,这与我们的因果方法形成对比。
原文: onsiderable advances in visual reading logic. Figure 2 : This figure shows two computer vision models with parallelized queries: DETR’s decoder [ carion2020end ] for object detection and BLIP2’s Q-former [ li2023blip ] for visual token compression. Both employ bidirectional self-attention among queries.

2 Related Works

2.1 解码器中的并行查询 DETR开创了将Transformer架构集成到目标检测中的先河,从根本上摆脱了传统检测范式。为了克服Transformer块中串行解码的效率限制,DETR引入了并行化查询机制。
原文: 2.1 Parallelized Queries in Decoder DETR [ carion2020end ] pioneered the integration of transformer architecture into object detection, fundamentally breaking away from traditional detection paradigms [ ren2015faster , redmon2017yolo9000 ] . To overcome the efficiency limitations of serial decoding in transformer blocks, DETR introduced preset parallelized learnable queries—a set of 100 object queries that encode object priors such as shape and position through training. These queries interact with feature maps [ he2016deep ] via cross-attention mechanisms, while simultaneously engaging in bid...

2 Related Works

也用于多模态对齐中的token压缩。2.3 基于LLM的多模态初始化:在大规模互联网数据上训练的LLM已被证明是多模态模型初始化的有效方法。Pang等人证明了冻结的LLM Transformer层增强了视觉判别任务的性能。
原文: also for token compression in multimodal alignment. 2.3 LLM-based Multimodal Initialization LLMs trained on large-scale internet data have proven effective as initialization for multimodal models. Pang et al. [ pang2023fozen ] demonstrated that frozen LLM transformer layers enhance visual discriminative tasks. Moreover, encoder-free or lightweight-encoder models such as Fuyu [ fuyu8b_model ] and Chameleon [ chameleon2024 ] in vision, as well as VALL-E [ wang2023neural ] in speech, further validate the potential of LLM pretrained weights for multimodal initialization. Figure 3 : DeepSeek-OCR 2 ...

2.1 Parallelized Queries in Decoder

2.3 基于LLM的多模态初始化 在大规模互联网数据上训练的LLM已被证明是多模态模型初始化的有效方法。Pang等人证明了冻结的LLM Transformer层增强了视觉判别任务。此外,无编码器或轻量级编码器模型如Fuyu也展示了这一方向的有效性。
原文: DETR [ carion2020end ] pioneered the integration of transformer architecture into object detection, fundamentally breaking away from traditional detection paradigms [ ren2015faster , redmon2017yolo9000 ] . To overcome the efficiency limitations of serial decoding in transformer blocks, DETR introduced preset parallelized learnable queries—a set of 100 object queries that encode object priors such as shape and position through training. These queries interact with feature maps [ he2016deep ] via cross-attention mechanisms, while simultaneously engaging in bidirectional information exchange amon...

2.2 Parallelized Queries in Projector

近年来,视觉语言模型发展迅速,架构趋向于编码器-投影器-LLM范式。投影器将视觉token与LLM的嵌入空间对齐,作为使LLM能够理解视觉信息的关键桥梁,这种架构设计已成为多模态AI的主流选择。 这种三阶段架构设计在Qwen-VL、Qwen2.5-VL等现代多模态模型中得到了广泛应用和验证。
原文: In recent years, vision-language models [ li2023blip , Qwen-VL , Qwen2.5-VL , wei2024vary ] have developed rapidly, with architectures converging toward the encoder-projector-LLM paradigm. The projector aligns visual tokens with the LLM’s embedding space, serving as a critical bridge that enables LLMs to understand visual content. Q-former, introduced in BLIP-2 [ li2023blip ] , exemplifies an effective projector design that employs learnable queries for visual token compression. Adopting a BERT-like [ devlin2019bert ] architecture and drawing inspiration from DETR’s object queries [ carion2020...

2.3 LLM-based Multimodal Initialization

在大规模互联网数据上训练的LLM已被证明是多模态模型初始化的有效方法。Pang等人证明了冻结的LLM Transformer层增强了视觉判别任务。此外,无编码器或轻量级编码器模型如Fuyu也展示了这一方向的有效性。
原文: LLMs trained on large-scale internet data have proven effective as initialization for multimodal models. Pang et al. [ pang2023fozen ] demonstrated that frozen LLM transformer layers enhance visual discriminative tasks. Moreover, encoder-free or lightweight-encoder models such as Fuyu [ fuyu8b_model ] and Chameleon [ chameleon2024 ] in vision, as well as VALL-E [ wang2023neural ] in speech, further validate the potential of LLM pretrained weights for multimodal initialization. Figure 3 : DeepSeek-OCR 2 adopts the visual token compression mechanism from DeepEncoder, employing an 80M-parameter i...

3 Methodology

3.1 架构 如图3所示,DeepSeek-OCR 2继承了DeepSeek-OCR的整体架构,由编码器和解码器组成。编码器将图像离散化为视觉token,而解码器在这些视觉token和文本提示的条件下生成输出。关键区别在于编码器设计。
原文: 3.1 Architecture As shown in Figure 3 , DeepSeek-OCR 2 inherits the overall architecture of DeepSeek-OCR, which consists of an encoder and a decoder. The encoder discretizes images into visual tokens, while the decoder generates outputs conditioned on these visual tokens and text prompts. The key distinction lies in the encoder: we upgrade DeepEncoder to DeepEncoder V2, which retains all capabilities of its predecessor while introducing causal reasoning through a novel architectural design. We elaborate on the details of DeepSeek-OCR 2 in the following sections. 3.2 DeepEncoder V2 The vanilla ...

3 Methodology

通过窗口注意力和最少的参数实现,显著降低了后续全局注意力模块的计算成本和激活内存。此外,其参数数量(80M)与用于文本输入的典型1亿参数相当,保持了模型的整体效率。 这种设计在保证性能的同时,有效控制了模型规模,使得部署更加高效。
原文: i2024small , huang2026step3 ] through window attention with minimal parameters, significantly reducing both computational cost and activation memory for the subsequent global attention module. Moreover, its parameter count ( 80M) remains comparable to the typical 100M parameters used for text input embeddings in LLMs. 3.2.2 Language model as vision encoder In DeepEncoder, a CLIP ViT follows the vision tokenizer to compress visual knowledge. DeepEncoder V2 redesigns this component into an LLM-style architecture with a dual-stream attention mechanism. Visual tokens utilize bidirectional attentio...

3 Methodology

与通过位置编码强加严格空间顺序的编码器不同,我们的因果有序查询适应平滑的视觉语义,同时自然地与LLM的单向注意力模式对齐。这一设计可能弥合2D空间结构和1D因果语言建模之间的差距。 这是视觉语言模型设计中的一个重要创新,为后续研究提供了新的思路。
原文: ncoders that impose rigid spatial ordering through positional encodings, our causally-ordered queries adapt to smooth visual semantics while naturally aligning with the LLM’s unidirectional attention pattern. This design may bridge the gap between 2D spatial structure and 1D causal language modeling. Figure 5 : Attention mask architecture of DeepEncoder V2. Concatenation of bidirectional mask (vision tokens, ViT-like) and causal triangular mask (flow tokens, LLM decoder-style). 3.2.3 Causal flow query As aforementioned, the number of causal query tokens equals the number of visual tokens, comp...

3 Methodology

因果注意力(三角掩码,与仅解码器LLM相同)用于因果流token,其中每个token只关注之前的token。这两个组件沿序列维度连接以构建DeepEncoder V2的注意力掩码,确保了因果性约束。 这种注意力掩码设计是DeepEncoder V2的核心组件之一,确保了视觉信息的因果性处理。
原文: sal attention (triangular mask, identical to decoder-only LLMs) for causal flow tokens, where each token attends only to previous tokens. These two components are concatenated along the sequence dimension to construct DeepEncoder V2’s attention mask (M), as follows: M = [ 𝟏 m × m 𝟎 m × n 𝟏 n × m LowerTri ​ ( n ) ] , where ​ n = m M=\begin{bmatrix}\mathbf{1}_{m\times m}&\mathbf{0}_{m\times n}\\ \mathbf{1}_{n\times m}&\text{LowerTri}(n)\end{bmatrix},\quad\text{where }n=m (1) where n n is the number of causal query tokens, m m represents vanilla visual tokens number, and LowerTri denotes a lower ...

3 Methodology

从LLM词汇表中采样:因果查询token在训练过程中从LLM词汇表中随机采样,这使得编码器能够学习将视觉信息映射到语言空间中,实现视觉和语言模态的对齐,这是多模态理解的关键。 这种采样策略使得模型能够在训练过程中学习到丰富的视觉-语言对应关系。
原文: er LLM vocabulary.

3.1 Architecture

如图3所示,DeepSeek-OCR 2继承了DeepSeek-OCR的整体架构,由编码器和解码器组成。编码器将图像离散化为视觉token,而解码器在这些视觉token和文本提示的条件下生成输出。关键区别在于编码器设计。
原文: As shown in Figure 3 , DeepSeek-OCR 2 inherits the overall architecture of DeepSeek-OCR, which consists of an encoder and a decoder. The encoder discretizes images into visual tokens, while the decoder generates outputs conditioned on these visual tokens and text prompts. The key distinction lies in the encoder: we upgrade DeepEncoder to DeepEncoder V2, which retains all capabilities of its predecessor while introducing causal reasoning through a novel architectural design. We elaborate on the details of DeepSeek-OCR 2 in the following sections.

3.2 DeepEncoder V2

普通编码器是一个重要组件,通过注意力机制提取和压缩图像特征,其中每个token关注所有其他token,实现类似于人类中央和周边视觉的全图感受野。然而,将2D图像块展平为1D序列会丢失空间信息。 这是我们提出因果编码器的动机之一,通过因果查询来保留空间信息。
原文: The vanilla encoder serves as an important component that extracts and compresses image features through attention mechanisms, where each token attends to all others, achieving full-image receptive fields analogous to human foveal and peripheral vision. However, flattening 2D image patches into a 1D sequence imposes a rigid ordering bias through text-oriented positional encodings (e.g., RoPE [ su2021roformer ] ). This contradicts natural visual reading patterns, especially non-linear layouts in optical texts, forms and tables. Figure 4 : Token count calculation in DeepEncoder V2. DeepEncoder V...

3.2 DeepEncoder V2

全局注意力以保持CLIP的全局建模能力,而新引入的因果流查询使用因果注意力。这些可学习查询作为后缀附加在视觉token之后,其中每个查询关注所有视觉token和前面的查询,形成了层次化的表示。 这种设计结合了全局和局部建模的优势,使得模型能够处理复杂的文档布局。
原文: nal attention to preserve CLIP’s global modeling capability, while newly introduced causal flow queries employ causal attention. These learnable queries are appended after visual tokens as a suffix, where each query attends to all visual tokens and preceding queries. By maintaining equal cardinality between queries and visual tokens, this design imposes semantic ordering and distilling on visual features without altering token count. Finally, only the causal query outputs are fed to the LLM decoder. We instantiate this architecture using Qwen2-0.5B [ wang2024qwen2 ] , whose 500M parameters are...

3.2 DeepEncoder V2

token数量计算为 W×H/(16²×16),其中W和H表示输入到编码器的图像宽度和高度。为避免维护不同分辨率的多个查询集,我们采用固定查询配置的多裁剪策略,简化了模型设计。 这种策略不仅简化了模型实现,还提高了训练的稳定性和一致性。
原文: tokens, computed as W × H 16 2 × 16 \frac{W\times H}{16^{2}\times 16} , where W W and H H denote the width and height of the image input to the encoder. To avoid maintaining multiple query sets for different resolutions, we adopt a multi-crop strategy with fixed query configurations at predefined resolutions. Specifically, the global view uses a resolution of 1024 × 1024 1024\times 1024 , corresponding to 256 query embeddings denoted as query global \text{query}_{\text{global}} . Local crops adopt a resolution of 768 × 768 768\times 768 , with the number of crops k k ranging from 0 to 6 (no cr...

3.2 DeepEncoder V2

创建一个下三角矩阵(对角线及以下为1,以上为0):这是因果注意力的标准实现,确保每个token只能关注之前的token,不允许关注未来的token,保持了自回归生成的因果性约束,这是语言模型的核心特性。
原文: tes a lower triangular matrix (with ones on and below the diagonal, zeros above).

3.3 DeepSeek-MoE Decoder

由于DeepSeek-OCR 2主要关注编码器改进,我们未升级解码器组件。遵循这一设计原则,我们保留了DeepSeek-OCR的解码器——一个3B参数的MoE结构,约5亿活跃参数。DeepSeek-OCR 2的核心前向传递可公式化为:
原文: Since DeepSeek-OCR 2 primarily focuses on encoder improvements, we do not upgrade the decoder component. Following this design principle, we retain DeepSeek-OCR’s decoder − - a 3B-parameter MoE structure with about 500M active parameters. The core forward pass of DeepSeek-OCR 2 can be formulated as: 𝐎 = 𝒟 ​ ( π Q ​ ( 𝒯 L ​ ( ℰ ​ ( 𝐈 ) ⊕ 𝐐 0 ; 𝐌 ) ) ) \mathbf{O}=\mathcal{D}\left(\pi_{Q}\left(\mathcal{T}^{L}\left(\mathcal{E}(\mathbf{I})\oplus\mathbf{Q}_{0};\mathbf{M}\right)\right)\right) (2) where 𝐈 ∈ ℝ H × W × 3 \mathbf{I}\in\mathbb{R}^{H\times W\times 3} is the input image, ℰ \mathcal{E} is th...

4 Experimental Settings

4.1 数据引擎 DeepSeek-OCR 2采用与DeepSeek-OCR相同的数据源,包括OCR 1.0、OCR 2.0和通用视觉数据,OCR数据占训练混合的80%。我们还引入了两项修改:(1)更平衡的数据分布,(2)更严格的过滤策略。
原文: 4.1 Data Engine DeepSeek-OCR 2 employs the same data sources as DeepSeek-OCR, comprising OCR 1.0, OCR 2.0 [ chen2024onechart , wei2024slow , liu2024focus_fox ] , and general vision data [ wei2025deepseek ] , with OCR data constituting 80% of the training mixture. We also introduce two modifications: (1) a more balanced sampling strategy for OCR 1.0 data, partitioning pages by content type (text, formulas, tables) with a 3:1:1 ratio, and (2) label refinement for layout detection by merging semantically similar categories (e.g., unifying "figure caption" and "figure title"). Given these minimal ...

4 Experimental Settings

约1亿图像-文本对样本)。4.2.2 查询增强 DeepEncoder V2预训练后,我们将其与DeepSeek-3B-A500M集成为最终管道。我们冻结视觉tokenizer(SAM-conv结构),同时优化LLM编码器和LLM解码器。
原文: about 100M image-text pair samples). 4.2.2 Query enhancement After DeepEncoder V2 pretraining, we integrate it with DeepSeek-3B-A500M [ deepseekv2 , deepseekv3 ] as our final pipeline. We freeze the visual tokenizer (SAM-conv structure) while jointly optimizing the LLM encoder and LLM decoder to enhance query representations. At this stage, we unify the two resolutions into a single dataloader via multi-crop strategy. We adopt 4-stage pipeline parallelism: vision tokenizer (PP0), LLM-style encoder (PP1), and DeepSeek-LLM layers (6 layers per stage on PP2-3). With 160 GPUs (40GB/per-GPU), we co...

4 Experimental Settings

性能对比表:DeepSeek-OCR 2与现有OCR模型的比较结果。在OmniDocBench基准测试中,DeepSeek-OCR 2达到了91.09%的准确率,超过了PP-StructureV3(86.73%)、MonkeyOCR-pro-3B(87.45%)等模型,同时使用了最少的视觉token数量,证明了其效率优势。
原文: phin ] - 83.21 0.092 80.78 78.06 84.10 0.080 PP-StructureV3 [ cui2025paddleocr ] - 86.73 0.073 85.79 81.68 89.48 0.073 MonkeyOCR-pro-1.2B [ li2025monkeyocr ] - 86.96 0.084 85.02 84.24 89.02 0.130 MonkeyOCR-3B [ li2025monkeyocr ] - 87.13 0.075 87.45 81.39 85.92 0.129 MonkeyOCR-pro-3B [ li2025monkeyocr ] - 88.85 0.075 87.25 86.78 90.63 0.128 MinerU2.5 [ wang2024mineru ] - 90.67 0.047 88.46 88.22 92.38 0.044 PaddleOCR-VL [ cui2025paddleocrvl ] - 92.86 0.035 91.22 90.89 94.76 0.043 End-to-end Model OCRFlux [ ocrflux ] >6000 74.82 0.193 68.03 75.75 80.23 0.202 GPT-4o [ GPT4 ] - 75.02 0.217 79.70 67...

4.1 Data Engine

DeepSeek-OCR 2采用与DeepSeek-OCR相同的数据源,包括OCR 1.0、OCR 2.0和通用视觉数据,OCR数据占训练混合的80%。我们还引入了两项修改:(1)更平衡的数据分布,(2)更严格的过滤策略。
原文: DeepSeek-OCR 2 employs the same data sources as DeepSeek-OCR, comprising OCR 1.0, OCR 2.0 [ chen2024onechart , wei2024slow , liu2024focus_fox ] , and general vision data [ wei2025deepseek ] , with OCR data constituting 80% of the training mixture. We also introduce two modifications: (1) a more balanced sampling strategy for OCR 1.0 data, partitioning pages by content type (text, formulas, tables) with a 3:1:1 ratio, and (2) label refinement for layout detection by merging semantically similar categories (e.g., unifying "figure caption" and "figure title"). Given these minimal differences, we ...

4.2 Training Pipelines

我们分三个阶段训练DeepSeek-OCR 2:(1)编码器预训练,(2)查询增强,(3)解码器专业化。第一阶段使视觉tokenizer和LLM风格编码器获得特征提取、token压缩和token重排序的基本能力。
原文: We train DeepSeek-OCR 2 in three stages: (1) encoder pretraining, (2) query enhancement, and (3) decoder specialization. The stage-1 enables the vision tokenizer and LLM-style encoder to acquire fundamental capabilities in feature extraction, token compression, and token reordering capabilities. The stage-2 further strengthens the token reordering capability of the encoder while enhancing visual knowledge compression. The stage-3 freezes the encoder parameters and optimizes only the decoder, enabling higher data throughput under the same FLOPs. 4.2.1 Training DeepEncoder V2 Following DeepSeek-...

4.2 Training Pipelines

使用相同的优化器和学习率衰减,从5e-5衰减到1e-6,共15k次迭代。4.2.3 继续训练LLM 为快速消耗训练数据,我们在此阶段冻结所有DeepEncoder V2参数,仅更新DeepSeek-LLM参数。此阶段加速了训练(超过两倍)。
原文: using the same optimizer and learning rate decay from 5e-5 to 1e-6 over 15k iterations. 4.2.3 Continue-training LLM To rapidly consume training data, we freeze all DeepEncoder V2 parameters in this stage and only update the DeepSeek-LLM parameters. This stage accelerates training (more than doubles the training speed under the same global batch size) while helping the LLM better understand DeepEncoder V2’s reordered visual tokens. Continuing from stage-2, we perform another learning rate decay from 1e-6 to 5e-8 training for 20k iterations in this stage. Table 1 : Comprehensive evaluation of do...

4.2 Training Pipelines

性能对比表:与多个OCR模型的基准测试结果比较。DeepSeek-OCR 2在多个指标上表现出色,特别是在文档理解和表格识别方面。与S-Reader、olmOCR、InternVL3.5-241B等大型模型相比,DeepSeek-OCR 2在保持较小模型规模的同时实现了竞争性的性能。
原文: S-Reader [ liu2025pointsreader ] >6000 80.98 0.134 79.20 77.13 81.66 0.145 olmOCR [ poznanski2025olmocr ] >6000 81.79 0.096 86.04 68.92 74.77 0.121 InternVL3.5-241B [ wang2025internvl35 ] >7000 82.67 0.142 87.23 75.00 81.28 0.125 MinerU2-VLM [ wang2024mineru ] >7000 85.56 0.078 80.95 83.54 87.66 0.086 Nanonets-OCR-s [ NanonetsOCRs ] >7000 85.59 0.093 85.90 80.14 85.57 0.108 Qwen2.5-VL-72B [ Qwen2.5-VL ] >6000 87.02 0.094 88.27 82.15 86.22 0.102 Gemini-2.5 Pro [ google_gemini_web ] - 88.03 0.075 85.82 85.71 90.29 0.097 dots.ocr [ dots ] >6000 88.41 0.048 83.22 86.78 90.62 0.053 OCRVerse [ OCRVe...

5 Evaluation

我们选择OmniDocBench v1.5作为主要评估基准。该基准包含1,355个文档页面,涵盖9个主要类别(包括杂志、学术论文、研究报告等),包含中英文。其多样化的测试样本和严格的评估标准使其成为理想的OCR性能评估工具。
原文: We select OmniDocBench v1.5 [ ouyang2025omnidocbench ] as our primary benchmark for evaluation. This benchmark comprises 1,355 document pages spanning 9 major categories (including magazines, academic papers, research reports, and so on) in both Chinese and English. With its diverse test samples and robust evaluation criteria, OmniDocBench provides an effective framework for validating the performance of DeepSeek-OCR 2, particularly the effectiveness of DeepEncoder V2. 5.1 Main Results As shown in Table 1 , DeepSeek-OCR 2 achieves advanced performance of 91.09% while using the smallest upper l...

5 Evaluation

我们对DeepSeek-OCR 2在9种文档类型上的性能进行了详细比较,发现DeepSeek-OCR 2仍有相当大的改进空间,如表3所示。对于文本识别编辑距离(ED),DeepSeek-OCR 2在大多数情况下优于DeepSeek-OCR。
原文: d DeepSeek-OCR 2 across 9 document types and found that DeepSeek-OCR 2 still has considerable room for improvement, as shown in Table 3 . For text recognition Edit Distance (ED), DeepSeek-OCR 2 outperforms DeepSeek-OCR in most cases, but there are also notable weaknesses, such as newspapers, where it performs > 0.13 >0.13 ED. We believe there are two main reasons: (1) the lower upper limit of visual tokens may affect the recognition of text-super-rich newspapers, which can be simply addressed in the future by increasing the number of local crops; (2) insufficient newspaper data − - our trainin...

5 Evaluation

为DeepSeek-LLM读取图像/文档的在线OCR服务,以及执行批量PDF处理的预训练数据管道。我们比较了DeepSeek-OCR 2和DeepSeek-OCR的生产性能。由于生产环境中没有真实标签,我们主要关注重复率。
原文: ervice that reads image/documents for DeepSeek-LLMs, and a pretraining data pipeline that performs batch PDF processing. We compare the production performance between DeepSeek-OCR 2 and DeepSeek-OCR. Since ground truth is unavailable in production environments, we focus primarily on repetition rate as our key metric. As shown in Table 4 , DeepSeek-OCR 2 demonstrates markedly improved practical readiness compared to its predecessor (DeepSeek-OCR), reducing the repetition rate from 6.25% to 4.17% for online user-log images, and from 3.69% to 2.88% for PDF data production. These results further v...

5.1 Main Results

如表1所示,DeepSeek-OCR 2在91.09%的先进性能,同时使用最少的视觉token上限。与DeepSeek-OCR基线相比,它在相似训练数据源下展示了3.73%的提升,验证了我们新设计的有效性。
原文: As shown in Table 1 , DeepSeek-OCR 2 achieves advanced performance of 91.09% while using the smallest upper limit of visual tokens (V-token max ). Compared to the DeepSeek-OCR baseline, it demonstrates a 3.73% improvement under similar train data sources, validating the effectiveness of our newly designed architecture. Beyond the overall improvement, the Edit Distance (ED) for reading order (R-order) has also significantly decreased (from 0.085 to 0.057), indicating that the new DeepEncoder V2 can effectively select and arrange initial visual tokens based on image information. As illustrated i...

5.2 Improvement Headroom

我们对DeepSeek-OCR和DeepSeek-OCR 2在9种文档类型上的性能进行了详细比较,发现DeepSeek-OCR 2仍有相当大的改进空间,如表3所示。对于文本识别编辑距离(ED),DeepSeek-OCR 2在大多数情况下优于DeepSeek-OCR。
原文: We conduct a detailed performance comparison between DeepSeek-OCR and DeepSeek-OCR 2 across 9 document types and found that DeepSeek-OCR 2 still has considerable room for improvement, as shown in Table 3 . For text recognition Edit Distance (ED), DeepSeek-OCR 2 outperforms DeepSeek-OCR in most cases, but there are also notable weaknesses, such as newspapers, where it performs > 0.13 >0.13 ED. We believe there are two main reasons: (1) the lower upper limit of visual tokens may affect the recognition of text-super-rich newspapers, which can be simply addressed in the future by increasing the nu...

5.3 Practical Readiness

DeepSeek-OCR服务于两个主要生产用例:为DeepSeek-LLM读取图像/文档的在线OCR服务,以及执行批量PDF处理的预训练数据管道。我们比较了DeepSeek-OCR 2和DeepSeek-OCR的生产性能。由于生产环境中没有真实标签。
原文: DeepSeek-OCR serves two primary production use cases: an online OCR service that reads image/documents for DeepSeek-LLMs, and a pretraining data pipeline that performs batch PDF processing. We compare the production performance between DeepSeek-OCR 2 and DeepSeek-OCR. Since ground truth is unavailable in production environments, we focus primarily on repetition rate as our key metric. As shown in Table 4 , DeepSeek-OCR 2 demonstrates markedly improved practical readiness compared to its predecessor (DeepSeek-OCR), reducing the repetition rate from 6.25% to 4.17% for online user-log images, and...

6 Discussion and Future Works

6.1 迈向真正的2D推理 DeepSeek-OCR 2提出了一种新的架构范式,LLM风格编码器与LLM解码器级联。这两个1D因果推理器的级联有望实现真正的2D推理:编码器执行阅读逻辑推理(通过查询token因果重排序视觉信息)。
原文: 6.1 Towards Genuine 2D Reasoning DeepSeek-OCR 2 presents a novel architectural paradigm with an LLM-style encoder cascaded with an LLM decoder. This cascade of two 1D causal reasoners holds promise for genuine 2D reasoning: the encoder performs reading logic reasoning (causally reordering visual information through query tokens), while the decoder executes visual task reasoning over these causally-ordered representations. Decomposing 2D understanding into two complementary/orthogonal 1D causal reasoning subtasks may represent a breakthrough toward genuine 2D reasoning. Of course, achieving thi...

6.1 Towards Genuine 2D Reasoning

DeepSeek-OCR 2提出了一种新的架构范式,LLM风格编码器与LLM解码器级联。这两个1D因果推理器的级联有望实现真正的2D推理:编码器执行阅读逻辑推理(通过查询token因果重排序视觉信息)。
原文: DeepSeek-OCR 2 presents a novel architectural paradigm with an LLM-style encoder cascaded with an LLM decoder. This cascade of two 1D causal reasoners holds promise for genuine 2D reasoning: the encoder performs reading logic reasoning (causally reordering visual information through query tokens), while the decoder executes visual task reasoning over these causally-ordered representations. Decomposing 2D understanding into two complementary/orthogonal 1D causal reasoning subtasks may represent a breakthrough toward genuine 2D reasoning. Of course, achieving this goal remains a long journey. Fo...

6.2 Towards Native Multimodality

DeepEncoder V2为LLM风格编码器在视觉任务中的可行性提供了初步验证。更重要的是,这一架构有望演变为统一的万能模态编码器:单个编码器共享Wk、Wv投影、注意力机制和FFN,能够处理多种模态的输入。
原文: DeepEncoder V2 provides initial validation of the LLM-style encoder’s viability for visual tasks. More importantly, this architecture enjoys the potential to evolve into a unified omni-modal encoder: a single encoder with shared W ​ k , W ​ v Wk,Wv projections, attention mechanisms, and FFNs can process multiple modalities through modality-specific learnable query embeddings. Such an encoder could compress text, extract speech features, and reorganize visual content within the same parameter space, differing only in the learned weights of their query embeddings. DeepSeek-OCR’s optical compress...

7 Conclusion

在本技术报告中,我们提出了DeepSeek-OCR 2,这是DeepSeek-OCR的重要升级,在保持高视觉token压缩的同时实现了有意义的性能提升。这一进步由新提出的DeepEncoder V2驱动,它隐式地将因果理解蒸馏到查询token中。
原文: In this technical report, we present DeepSeek-OCR 2, a significant upgrade to DeepSeek-OCR, that maintains high visual token compression while achieving meaningfully performance improvements. This advancement is powered by the newly proposed DeepEncoder V2, which implicitly distills causal understanding of the visual world through the integration of both bidirectional and causal attention mechanisms, leading to causal reasoning capabilities in the vision encoder and, consequently, marked lifts in visual reading logic. While optical text reading, particularly document parsing, represents one of...
← 返回首页详细解读