[原文]Haoran Wei Yaofeng Sun Yukun Li DeepSeek-AI Abstract We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder—DeepEncoder V2—capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, ...
[原文]movements follow inherent logic where each subsequent fixation causally depends on previous ones. By analogy, visual tokens in models should be selectively processed with ordering highly contingent on visual semantics rather than spatial coordinates. This insight motivates us to fundamentally reconsider the architectural design of vision-language models (VLMs), particularly the encoder component. LLMs are inherently trained on 1D sequential data, while images are 2D structures. Directly flattening image patches in a predefined raster-scan order introduces unwarranted inductive bias that ignore...
[原文]utputs—are fed to the LLM [ deepseekv2 ] decoder, enabling cascade causal-aware visual understanding. Second, leveraging DeepEncoder V2, we present DeepSeek-OCR 2, which preserves the image compression ratio and decoding efficiency of DeepSeek-OCR while achieving substantial performance improvements. We constrain visual tokens fed to the LLM between 256 and 1120. The lower bound (256) corresponds to DeepSeek-OCR’s tokenization of 1024×1024 images, while the upper bound (1120) matches Gemini-3 pro’s [ team2023gemini ] maximum visual token budget. This design positions DeepSeek-OCR 2 as both a n...
[原文]ation of transformer architecture into object detection, fundamentally breaking away from traditional detection paradigms [ ren2015faster , redmon2017yolo9000 ] . To overcome the efficiency limitations of serial decoding in transformer blocks, DETR introduced preset parallelized learnable queries—a set of 100 object queries that encode object priors such as shape and position through training. These queries interact with feature maps [ he2016deep ] via cross-attention mechanisms, while simultaneously engaging in bidirectional information exchange among themselves through self-attention. DETR e...
[原文]nitialization LLMs trained on large-scale internet data have proven effective as initialization for multimodal models. Pang et al. [ pang2023fozen ] demonstrated that frozen LLM transformer layers enhance visual discriminative tasks. Moreover, encoder-free or lightweight-encoder models such as Fuyu [ fuyu8b_model ] and Chameleon [ chameleon2024 ] in vision, as well as VALL-E [ wang2023neural ] in speech, further validate the potential of LLM pretrained weights for multimodal initialization. Figure 3 : DeepSeek-OCR 2 adopts the visual token compression mechanism from DeepEncoder, employing an 8...
[原文]al reading patterns, especially non-linear layouts in optical texts, forms and tables. Figure 4 : Token count calculation in DeepEncoder V2. DeepEncoder V2 outputs 256 − - 1120 tokens per image using a multi-crop strategy with 0 − - 6 local views. With 0 local views, only the global view produces 256 tokens; with 6 local views, the count reaches 1120 (6 × \times 144+256). 3.2.1 Vision tokenizer The first component of DeepEncoder V2 is a vision tokenizer. Following DeepEncoder, we employ an architecture combining an 80M-parameter SAM-base [ kirilloV2023segment ] along with two convolutional lay...
[原文]only the causal query outputs are fed to the LLM decoder. We instantiate this architecture using Qwen2-0.5B [ wang2024qwen2 ] , whose 500M parameters are comparable to CLIP ViT (300M) without introducing excessive computational overhead. The decoder-only architecture with prefix-concatenation of visual tokens proves crucial: extra experiments with cross-attention in an mBART-style [ liu2020multilingual ] encoder-decoder structure fail to converge. We hypothesize this failure stems from insufficient visual token interaction when isolated in a separate encoder. In contrast, the prefix design kee...
[原文]global \text{query}_{\text{global}} . Local crops adopt a resolution of 768 × 768 768\times 768 , with the number of crops k k ranging from 0 to 6 (no cropping is applied when both image dimensions are smaller than 768). All local views share a unified set of 144 query embeddings, denoted as query local \text{query}_{\text{local}} . Therefore, the total number of reordered visual tokens fed to the LLM is k × 144 + 256 k\times 144+256 , ranging from [ 256 , 1120 ] [256,1120] . This maximum token count (1120) is lower than DeepSeek-OCR’s 1156 (Gundam mode) and matches Gemini-3-Pro’s maximum visu...
[原文]The human visual system closely mirrors transformer-based vision encoders [ dosovitskiy2020image , dehghani2023patch ] : foveal fixations function as visual tokens, locally sharp yet globally aware. However, unlike existing encoders that rigidly scan tokens from top-left to bottom-right, human vision follows a causally-driven flow guided by semantic understanding. Consider tracing a spiral—our eye movements follow inherent logic where each subsequent fixation causally depends on previous ones. By analogy, visual tokens in models should be selectively processed with ordering highly contingent o...
[原文]repended as a prefix—through a customized attention mask, visual tokens maintain global receptive fields, while causal flow tokens can obtain visual token reordering ability; (3) we maintain equal cardinality between causal and visual tokens (with redundancy such as padding and borders) to provide sufficient capacity for re-fixation; (4) only the causal flow tokens—the latter half of the encoder outputs—are fed to the LLM [ deepseekv2 ] decoder, enabling cascade causal-aware visual understanding. Second, leveraging DeepEncoder V2, we present DeepSeek-OCR 2, which preserves the image compressio...
[原文]onsiderable advances in visual reading logic. Figure 2 : This figure shows two computer vision models with parallelized queries: DETR’s decoder [ carion2020end ] for object detection and BLIP2’s Q-former [ li2023blip ] for visual token compression. Both employ bidirectional self-attention among queries.
[原文]also for token compression in multimodal alignment. 2.3 LLM-based Multimodal Initialization LLMs trained on large-scale internet data have proven effective as initialization for multimodal models. Pang et al. [ pang2023fozen ] demonstrated that frozen LLM transformer layers enhance visual discriminative tasks. Moreover, encoder-free or lightweight-encoder models such as Fuyu [ fuyu8b_model ] and Chameleon [ chameleon2024 ] in vision, as well as VALL-E [ wang2023neural ] in speech, further validate the potential of LLM pretrained weights for multimodal initialization. Figure 3 : DeepSeek-OCR 2 ...
[原文]DETR [ carion2020end ] pioneered the integration of transformer architecture into object detection, fundamentally breaking away from traditional detection paradigms [ ren2015faster , redmon2017yolo9000 ] . To overcome the efficiency limitations of serial decoding in transformer blocks, DETR introduced preset parallelized learnable queries—a set of 100 object queries that encode object priors such as shape and position through training. These queries interact with feature maps [ he2016deep ] via cross-attention mechanisms, while simultaneously engaging in bidirectional information exchange amon...
[原文]In recent years, vision-language models [ li2023blip , Qwen-VL , Qwen2.5-VL , wei2024vary ] have developed rapidly, with architectures converging toward the encoder-projector-LLM paradigm. The projector aligns visual tokens with the LLM’s embedding space, serving as a critical bridge that enables LLMs to understand visual content. Q-former, introduced in BLIP-2 [ li2023blip ] , exemplifies an effective projector design that employs learnable queries for visual token compression. Adopting a BERT-like [ devlin2019bert ] architecture and drawing inspiration from DETR’s object queries [ carion2020...
[原文]LLMs trained on large-scale internet data have proven effective as initialization for multimodal models. Pang et al. [ pang2023fozen ] demonstrated that frozen LLM transformer layers enhance visual discriminative tasks. Moreover, encoder-free or lightweight-encoder models such as Fuyu [ fuyu8b_model ] and Chameleon [ chameleon2024 ] in vision, as well as VALL-E [ wang2023neural ] in speech, further validate the potential of LLM pretrained weights for multimodal initialization. Figure 3 : DeepSeek-OCR 2 adopts the visual token compression mechanism from DeepEncoder, employing an 80M-parameter i...
[原文]i2024small , huang2026step3 ] through window attention with minimal parameters, significantly reducing both computational cost and activation memory for the subsequent global attention module. Moreover, its parameter count ( 80M) remains comparable to the typical 100M parameters used for text input embeddings in LLMs. 3.2.2 Language model as vision encoder In DeepEncoder, a CLIP ViT follows the vision tokenizer to compress visual knowledge. DeepEncoder V2 redesigns this component into an LLM-style architecture with a dual-stream attention mechanism. Visual tokens utilize bidirectional attentio...
[原文]ncoders that impose rigid spatial ordering through positional encodings, our causally-ordered queries adapt to smooth visual semantics while naturally aligning with the LLM’s unidirectional attention pattern. This design may bridge the gap between 2D spatial structure and 1D causal language modeling. Figure 5 : Attention mask architecture of DeepEncoder V2. Concatenation of bidirectional mask (vision tokens, ViT-like) and causal triangular mask (flow tokens, LLM decoder-style). 3.2.3 Causal flow query As aforementioned, the number of causal query tokens equals the number of visual tokens, comp...
[原文]sal attention (triangular mask, identical to decoder-only LLMs) for causal flow tokens, where each token attends only to previous tokens. These two components are concatenated along the sequence dimension to construct DeepEncoder V2’s attention mask (M), as follows: M = [ 𝟏 m × m 𝟎 m × n 𝟏 n × m LowerTri ( n ) ] , where n = m M=\begin{bmatrix}\mathbf{1}_{m\times m}&\mathbf{0}_{m\times n}\\ \mathbf{1}_{n\times m}&\text{LowerTri}(n)\end{bmatrix},\quad\text{where }n=m (1) where n n is the number of causal query tokens, m m represents vanilla visual tokens number, and LowerTri denotes a lower ...
[原文]As shown in Figure 3 , DeepSeek-OCR 2 inherits the overall architecture of DeepSeek-OCR, which consists of an encoder and a decoder. The encoder discretizes images into visual tokens, while the decoder generates outputs conditioned on these visual tokens and text prompts. The key distinction lies in the encoder: we upgrade DeepEncoder to DeepEncoder V2, which retains all capabilities of its predecessor while introducing causal reasoning through a novel architectural design. We elaborate on the details of DeepSeek-OCR 2 in the following sections.
[原文]The vanilla encoder serves as an important component that extracts and compresses image features through attention mechanisms, where each token attends to all others, achieving full-image receptive fields analogous to human foveal and peripheral vision. However, flattening 2D image patches into a 1D sequence imposes a rigid ordering bias through text-oriented positional encodings (e.g., RoPE [ su2021roformer ] ). This contradicts natural visual reading patterns, especially non-linear layouts in optical texts, forms and tables. Figure 4 : Token count calculation in DeepEncoder V2. DeepEncoder V...
[原文]nal attention to preserve CLIP’s global modeling capability, while newly introduced causal flow queries employ causal attention. These learnable queries are appended after visual tokens as a suffix, where each query attends to all visual tokens and preceding queries. By maintaining equal cardinality between queries and visual tokens, this design imposes semantic ordering and distilling on visual features without altering token count. Finally, only the causal query outputs are fed to the LLM decoder. We instantiate this architecture using Qwen2-0.5B [ wang2024qwen2 ] , whose 500M parameters are...
[原文]tokens, computed as W × H 16 2 × 16 \frac{W\times H}{16^{2}\times 16} , where W W and H H denote the width and height of the image input to the encoder. To avoid maintaining multiple query sets for different resolutions, we adopt a multi-crop strategy with fixed query configurations at predefined resolutions. Specifically, the global view uses a resolution of 1024 × 1024 1024\times 1024 , corresponding to 256 query embeddings denoted as query global \text{query}_{\text{global}} . Local crops adopt a resolution of 768 × 768 768\times 768 , with the number of crops k k ranging from 0 to 6 (no cr...
[原文]Since DeepSeek-OCR 2 primarily focuses on encoder improvements, we do not upgrade the decoder component. Following this design principle, we retain DeepSeek-OCR’s decoder − - a 3B-parameter MoE structure with about 500M active parameters. The core forward pass of DeepSeek-OCR 2 can be formulated as: 𝐎 = 𝒟 ( π Q ( 𝒯 L ( ℰ ( 𝐈 ) ⊕ 𝐐 0 ; 𝐌 ) ) ) \mathbf{O}=\mathcal{D}\left(\pi_{Q}\left(\mathcal{T}^{L}\left(\mathcal{E}(\mathbf{I})\oplus\mathbf{Q}_{0};\mathbf{M}\right)\right)\right) (2) where 𝐈 ∈ ℝ H × W × 3 \mathbf{I}\in\mathbb{R}^{H\times W\times 3} is the input image, ℰ \mathcal{E} is th...
[原文]about 100M image-text pair samples). 4.2.2 Query enhancement After DeepEncoder V2 pretraining, we integrate it with DeepSeek-3B-A500M [ deepseekv2 , deepseekv3 ] as our final pipeline. We freeze the visual tokenizer (SAM-conv structure) while jointly optimizing the LLM encoder and LLM decoder to enhance query representations. At this stage, we unify the two resolutions into a single dataloader via multi-crop strategy. We adopt 4-stage pipeline parallelism: vision tokenizer (PP0), LLM-style encoder (PP1), and DeepSeek-LLM layers (6 layers per stage on PP2-3). With 160 GPUs (40GB/per-GPU), we co...
[原文]DeepSeek-OCR 2 employs the same data sources as DeepSeek-OCR, comprising OCR 1.0, OCR 2.0 [ chen2024onechart , wei2024slow , liu2024focus_fox ] , and general vision data [ wei2025deepseek ] , with OCR data constituting 80% of the training mixture. We also introduce two modifications: (1) a more balanced sampling strategy for OCR 1.0 data, partitioning pages by content type (text, formulas, tables) with a 3:1:1 ratio, and (2) label refinement for layout detection by merging semantically similar categories (e.g., unifying "figure caption" and "figure title"). Given these minimal differences, we ...
[原文]We train DeepSeek-OCR 2 in three stages: (1) encoder pretraining, (2) query enhancement, and (3) decoder specialization. The stage-1 enables the vision tokenizer and LLM-style encoder to acquire fundamental capabilities in feature extraction, token compression, and token reordering capabilities. The stage-2 further strengthens the token reordering capability of the encoder while enhancing visual knowledge compression. The stage-3 freezes the encoder parameters and optimizes only the decoder, enabling higher data throughput under the same FLOPs. 4.2.1 Training DeepEncoder V2 Following DeepSeek-...
[原文]using the same optimizer and learning rate decay from 5e-5 to 1e-6 over 15k iterations. 4.2.3 Continue-training LLM To rapidly consume training data, we freeze all DeepEncoder V2 parameters in this stage and only update the DeepSeek-LLM parameters. This stage accelerates training (more than doubles the training speed under the same global batch size) while helping the LLM better understand DeepEncoder V2’s reordered visual tokens. Continuing from stage-2, we perform another learning rate decay from 1e-6 to 5e-8 training for 20k iterations in this stage. Table 1 : Comprehensive evaluation of do...
[原文]We select OmniDocBench v1.5 [ ouyang2025omnidocbench ] as our primary benchmark for evaluation. This benchmark comprises 1,355 document pages spanning 9 major categories (including magazines, academic papers, research reports, and so on) in both Chinese and English. With its diverse test samples and robust evaluation criteria, OmniDocBench provides an effective framework for validating the performance of DeepSeek-OCR 2, particularly the effectiveness of DeepEncoder V2. 5.1 Main Results As shown in Table 1 , DeepSeek-OCR 2 achieves advanced performance of 91.09% while using the smallest upper l...
[原文]d DeepSeek-OCR 2 across 9 document types and found that DeepSeek-OCR 2 still has considerable room for improvement, as shown in Table 3 . For text recognition Edit Distance (ED), DeepSeek-OCR 2 outperforms DeepSeek-OCR in most cases, but there are also notable weaknesses, such as newspapers, where it performs > 0.13 >0.13 ED. We believe there are two main reasons: (1) the lower upper limit of visual tokens may affect the recognition of text-super-rich newspapers, which can be simply addressed in the future by increasing the number of local crops; (2) insufficient newspaper data − - our trainin...
[原文]ervice that reads image/documents for DeepSeek-LLMs, and a pretraining data pipeline that performs batch PDF processing. We compare the production performance between DeepSeek-OCR 2 and DeepSeek-OCR. Since ground truth is unavailable in production environments, we focus primarily on repetition rate as our key metric. As shown in Table 4 , DeepSeek-OCR 2 demonstrates markedly improved practical readiness compared to its predecessor (DeepSeek-OCR), reducing the repetition rate from 6.25% to 4.17% for online user-log images, and from 3.69% to 2.88% for PDF data production. These results further v...
[原文]As shown in Table 1 , DeepSeek-OCR 2 achieves advanced performance of 91.09% while using the smallest upper limit of visual tokens (V-token max ). Compared to the DeepSeek-OCR baseline, it demonstrates a 3.73% improvement under similar train data sources, validating the effectiveness of our newly designed architecture. Beyond the overall improvement, the Edit Distance (ED) for reading order (R-order) has also significantly decreased (from 0.085 to 0.057), indicating that the new DeepEncoder V2 can effectively select and arrange initial visual tokens based on image information. As illustrated i...
[原文]We conduct a detailed performance comparison between DeepSeek-OCR and DeepSeek-OCR 2 across 9 document types and found that DeepSeek-OCR 2 still has considerable room for improvement, as shown in Table 3 . For text recognition Edit Distance (ED), DeepSeek-OCR 2 outperforms DeepSeek-OCR in most cases, but there are also notable weaknesses, such as newspapers, where it performs > 0.13 >0.13 ED. We believe there are two main reasons: (1) the lower upper limit of visual tokens may affect the recognition of text-super-rich newspapers, which can be simply addressed in the future by increasing the nu...
[原文]DeepSeek-OCR serves two primary production use cases: an online OCR service that reads image/documents for DeepSeek-LLMs, and a pretraining data pipeline that performs batch PDF processing. We compare the production performance between DeepSeek-OCR 2 and DeepSeek-OCR. Since ground truth is unavailable in production environments, we focus primarily on repetition rate as our key metric. As shown in Table 4 , DeepSeek-OCR 2 demonstrates markedly improved practical readiness compared to its predecessor (DeepSeek-OCR), reducing the repetition rate from 6.25% to 4.17% for online user-log images, and...
[原文]DeepSeek-OCR 2 presents a novel architectural paradigm with an LLM-style encoder cascaded with an LLM decoder. This cascade of two 1D causal reasoners holds promise for genuine 2D reasoning: the encoder performs reading logic reasoning (causally reordering visual information through query tokens), while the decoder executes visual task reasoning over these causally-ordered representations. Decomposing 2D understanding into two complementary/orthogonal 1D causal reasoning subtasks may represent a breakthrough toward genuine 2D reasoning. Of course, achieving this goal remains a long journey. Fo...
[原文]DeepEncoder V2 provides initial validation of the LLM-style encoder’s viability for visual tasks. More importantly, this architecture enjoys the potential to evolve into a unified omni-modal encoder: a single encoder with shared W k , W v Wk,Wv projections, attention mechanisms, and FFNs can process multiple modalities through modality-specific learnable query embeddings. Such an encoder could compress text, extract speech features, and reorganize visual content within the same parameter space, differing only in the learned weights of their query embeddings. DeepSeek-OCR’s optical compress...
[原文]In this technical report, we present DeepSeek-OCR 2, a significant upgrade to DeepSeek-OCR, that maintains high visual token compression while achieving meaningfully performance improvements. This advancement is powered by the newly proposed DeepEncoder V2, which implicitly distills causal understanding of the visual world through the integration of both bidirectional and causal attention mechanisms, leading to causal reasoning capabilities in the vision encoder and, consequently, marked lifts in visual reading logic. While optical text reading, particularly document parsing, represents one of...