DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR：上下文光学压缩

📄 arXiv: 2510.18234📅 2025-10-22PDF

翻译进度46 / 46 段 (100%)

中文摘要

DeepSeek-OCR 采用光学上下文压缩技术，通过 2D 光学映射实现高效的文档压缩和 OCR 识别。该模型能够在保持语义完整性的前提下大幅压缩文档图像信息，实现超高速的文档分析和文本提取。在扫描文档、合同、论文等场景下表现出优异的识别准确率和速度。

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li, DeepSeek-AI 摘要：我们提出DeepSeek-OCR作为通过光学2D映射压缩长上下文的可行性初步探索。DeepSeek-OCR由两个组件组成：DeepEncoder和作为解码器的DeepSeek3B-MoE-A570M。具体来说，DeepEncoder作为核心引擎，旨在高分辨率输入下保持低激活状态，同时实现高效的视觉token压缩。我们的方法证明了通过视觉token进行光学压缩的可行性，为未来多模态模型设计提供了新的思路。

原文: Haoran Wei Yaofeng Sun Yukun Li DeepSeek-AI Abstract We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compr...

DeepSeek-OCR: Contexts Optical Compression

文本信息。单个包含文档文本的图像可以使用比等效数字文本少得多的token来表示丰富的信息，这表明通过视觉token的光学压缩可以实现更高的压缩比。这一洞察促使我们从以LLM为中心的角度重新审视视觉语言模型（VLMs），关注视觉编码器如何能够高效地压缩视觉信息。

原文: textual information. A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text, suggesting that optical compression through vision tokens could achieve much higher compression ratios. This insight motivates us to reexamine vision-language models (VLMs) from an LLM-centric perspective, focusing on how vision encoders can enhance LLMs’ efficiency in processing textual information rather than basic VQA [ 12 , 24 , 32 , 41 , 16 ] what humans excel at. OCR tasks, as an intermediate modality bridging vision and language, ...

DeepSeek-OCR: Contexts Optical Compression

窗口注意力组件处理大量视觉token，而压缩器在视觉token进入密集全局注意力组件之前减少视觉token数量，实现有效的内存和token压缩。第三，我们基于DeepEncoder和DeepSeek3B-MoE开发了DeepSeek-OCR。如图所示，它在端到端OCR任务中实现了最先进性能。

原文: the window attention component processes a large number of vision tokens, while the compressor reduces vision tokens before they enter the dense global attention component, achieving effective memory and token compression. Third, we develop DeepSeek-OCR based on DeepEncoder and DeepSeek3B-MoE [ 19 , 20 ] . As shown in Figure LABEL:fig:omni , it achieves state-of-the-art performance within end-to-end models on OmniDocBench while using the fewest vision tokens. Additionally, we equip the model with capabilities for parsing charts, chemical formulas, simple geometric figures, and natural images t...

DeepSeek-OCR: Contexts Optical Compression

第一种是代表Vary的双塔架构，利用并行SAM编码器增加高分辨率图像处理的视觉词汇参数。虽然提供了可控的参数和激活内存，但这种方法存在显著缺点：需要双图像预处理，使部署复杂化并使编码器管道并行化变得困难。

原文: first type is a dual-tower architecture represented by Vary [ 36 ] , which utilizes parallel SAM [ 17 ] encoder to increase visual vocabulary parameters for high-resolution image processing. While offering controllable parameters and activation memory, this approach suffers from significant drawbacks: it requires dual image preprocessing that complicates deployment and makes encoder pipeline parallelism challenging during training. The second type is tile-based method exemplified by InternVL2.0 [ 8 ] , which processes images by dividing them into small tiles for parallel computation, reducing ...

DeepSeek-OCR: Contexts Optical Compression

任务。GOT-OCR2.0将OCR2.0的范围扩大到包括更多合成图像解析任务，并设计了性能-效率权衡的OCR模型，进一步突出了端到端OCR研究的潜力。此外，Qwen-VL系列、InternVL系列等通用视觉模型及其许多衍生模型continuously增强其文档OCR能力。

原文: sks. GOT-OCR2.0 [ 38 ] expands the scope of OCR2.0 to include more synthetic image parsing tasks and designs an OCR model with performance-efficiency trade-offs, further highlighting the potential of end-to-end OCR researches. Additionally, general vision models such as Qwen-VL series [ 35 ] , InternVL series [ 8 ] , and many their derivatives continuously enhance their document OCR capabilities to explore dense visual perception boundaries. However, a crucial research question that current models have not addressed is: for a document containing 1000 words, how many vision tokens are at least ...

DeepSeek-OCR: Contexts Optical Compression

高分辨率下的激活；3. 少量视觉token；4. 支持多种分辨率输入；5. 适中的参数数量。然而，如第2.1节所述，当前开源编码器不能完全满足所有这些条件。因此，我们自己设计了一种新型视觉编码器，名为DeepEncoder。图4：测试不同压缩比下的模型性能。

原文: tivation at high resolutions; 3.Few vision tokens; 4.Support for multiple resolution inputs; 5. Moderate parameter count. However, as described in the Section 2.1 , current open-source encoders cannot fully satisfy all these conditions. Therefore, we design a novel vision encoder ourselves, named DeepEncoder. Figure 4 : To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple resolution modes. 3.2.1 Architecture of DeepEncoder DeepEncoder mainly consists of two compone...

DeepSeek-OCR: Contexts Optical Compression

Gundam-M 分辨率 512 640 1024 1280 640+1024 1024+1280 Token数量 64 100 256 400 n×100+256 n×256+400 处理 resize resize padding padding resize+padding resize+padding 3.2.2 多分辨率支持：假设我们有一张包含1000个光学字符的图像，想知道需要多少视觉token进行解码。这需要模型支持可变数量的视觉token。

原文: dam-M Resolution 512 640 1024 1280 640+1024 1024+1280 Tokens 64 100 256 400 n × \times 100+256 n × \times 256+400 Process resize resize padding padding resize + padding resize + padding 3.2.2 Multiple resolution support Suppose we have an image with 1000 optical characters and we want to test how many vision tokens are needed for decoding. This requires the model to support a variable number of vision tokens. That is to say the DeepEncoder needs to support multiple resolutions. We meet the requirement aforementioned through dynamic interpolation of positional encodings, and design several reso...

1 Introduction

1 引言当前的大语言模型（LLMs）在处理长文本内容时面临显著的计算挑战，因为序列长度呈二次方增长。我们探索了一个潜在的解决方案：利用视觉模态作为文本信息的高效压缩媒介。单个包含文档文本的图像可以使用比等效数字文本少得多的token来表示丰富的信息。

原文: Current Large Language Models (LLMs) face significant computational challenges when processing long textual content due to quadratic scaling with sequence length. We explore a potential solution: leveraging visual modality as an efficient compression medium for textual information. A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text, suggesting that optical compression through vision tokens could achieve much higher compression ratios. This insight motivates us to reexamine vision-language models (VLMs) from a...

1 Introduction

架构，即使在高分辨率输入下也能保持低激活内存和最少的视觉token。它通过16×卷积压缩器串行连接窗口注意力和全局注意力编码器组件。这一设计确保窗口注意力组件处理大量视觉token，而压缩器在进入全局注意力之前减少视觉token数量。

原文: architecture that maintains low activation memory and minimal vision tokens even with high-resolution inputs. It serially connects window attention and global attention encoder components through a 16 × \times convolutional compressor. This design ensures that the window attention component processes a large number of vision tokens, while the compressor reduces vision tokens before they enter the dense global attention component, achieving effective memory and token compression. Third, we develop DeepSeek-OCR based on DeepEncoder and DeepSeek3B-MoE [ 19 , 20 ] . As shown in Figure LABEL:fig:om...

1 Introduction

当前开源VLM中常用的编码器，各自都存在各自的缺陷。这些缺陷包括计算复杂度高、内存占用大、token数量多等问题。我们的DeepEncoder旨在解决这些限制。这些限制包括计算复杂度高、内存占用大、token数量多等问题，限制了模型在实际部署中的效率。我们的DeepEncoder旨在解决这些限制，通过创新的架构设计实现高效的光学压缩。

原文: ncoders commonly used in current open-source VLMs, all of which suffer from their respective deficiencies.

2 Related Works

2 相关工作 2.1 VLM中的典型视觉编码器当前开源VLM采用三种主要类型的视觉编码器，如图2所示。第一种是代表Vary的双塔架构，利用并行SAM编码器增加高分辨率图像处理的视觉词汇参数。虽然提供了可控的参数和激活内存，但这种方法存在显著缺点。

原文: 2.1 Typical Vision Encoders in VLMs Current open-source VLMs employ three main types of vision encoders, as illustrated in Figure 2 . The first type is a dual-tower architecture represented by Vary [ 36 ] , which utilizes parallel SAM [ 17 ] encoder to increase visual vocabulary parameters for high-resolution image processing. While offering controllable parameters and activation memory, this approach suffers from significant drawbacks: it requires dual image preprocessing that complicates deployment and makes encoder pipeline parallelism challenging during training. The second type is tile-ba...

2 Related Works

首先采用端到端框架进行arXiv学术论文OCR，展示了模型在处理密集感知任务方面的潜力。GOT-OCR2.0将OCR2.0的范围扩大到包括更多合成图像解析任务，并设计了性能-效率权衡的OCR模型，进一步突出了端到端OCR研究的潜力。此外，通用视觉模型也在不断发展。

原文: first employs end-to-end framework for academic paper OCR on arXiv, demonstrating the potential of models in handling dense perception tasks. GOT-OCR2.0 [ 38 ] expands the scope of OCR2.0 to include more synthetic image parsing tasks and designs an OCR model with performance-efficiency trade-offs, further highlighting the potential of end-to-end OCR researches. Additionally, general vision models such as Qwen-VL series [ 35 ] , InternVL series [ 8 ] , and many their derivatives continuously enhance their document OCR capabilities to explore dense visual perception boundaries. However, a crucia...

2.1 Typical Vision Encoders in VLMs

当前开源VLM采用三种主要类型的视觉编码器，如图2所示。第一种是代表Vary的双塔架构，利用并行SAM编码器增加高分辨率图像处理的视觉词汇参数。虽然提供了可控的参数和激活内存，但这种方法存在显著缺点，需要双图像预处理。

原文: Current open-source VLMs employ three main types of vision encoders, as illustrated in Figure 2 . The first type is a dual-tower architecture represented by Vary [ 36 ] , which utilizes parallel SAM [ 17 ] encoder to increase visual vocabulary parameters for high-resolution image processing. While offering controllable parameters and activation memory, this approach suffers from significant drawbacks: it requires dual image preprocessing that complicates deployment and makes encoder pipeline parallelism challenging during training. The second type is tile-based method exemplified by InternVL2....

2.2 End-to-end OCR Models

2.2 端到端OCR模型 OCR，特别是文档解析任务，一直是图像到文本领域的热点话题。随着VLM的发展，大量端到端OCR模型涌现，从根本上改变了传统的管道架构（需要单独的检测和识别专家模型），简化了OCR系统。Nougat首先采用端到端框架进行学术论文OCR。

原文: OCR, particularly document parsing task, has been a highly active topic in the image-to-text domain. With the advancement of VLMs, a large number of end-to-end OCR models have emerged, fundamentally transforming the traditional pipeline architecture (which required separate detection and recognition expert models) by simplifying OCR systems. Nougat [ 6 ] first employs end-to-end framework for academic paper OCR on arXiv, demonstrating the potential of models in handling dense perception tasks. GOT-OCR2.0 [ 38 ] expands the scope of OCR2.0 to include more synthetic image parsing tasks and desig...

3 Methodology

3 方法 3.1 架构如图3所示，DeepSeek-OCR采用统一的端到端VLM架构，由编码器和解码器组成。编码器（即DeepEncoder）负责提取图像特征并token化和压缩视觉表示。解码器用于基于图像token和提示生成所需结果。DeepEncoder约3.8亿参数。

原文: 3.1 Architecture As shown in Figure 3 , DeepSeek-OCR enjoys a unified end-to-end VLM architecture consisting of an encoder and a decoder. The encoder (namely DeepEncoder) is responsible for extracting image features and tokenizing as well as compressing visual representations. The decoder is used for generating the required result based on image tokens and prompts. DeepEncoder is approximately 380M in parameters, mainly composed of an 80M SAM-base [ 17 ] and a 300M CLIP-large [ 29 ] connected in series. The decoder adopts a 3B MoE [ 19 , 20 ] architecture with 570M activated parameters. In the...

3 Methodology

3.1 架构如图3所示，DeepSeek-OCR采用统一的端到端VLM架构，由编码器和解码器组成。编码器（即DeepEncoder）负责提取图像特征并token化和压缩视觉表示。解码器用于基于图像token和提示生成所需结果。DeepEncoder约3.8亿参数。

原文: 36 ] and use a 2-layer convolutional module to perform 16 × \times downsampling of vision tokens. Each convolutional layer has a kernel size of 3, stride of 2, padding of 1, and channels increase from 256 to 1024. Assuming we input a 1024 × \times 1024 image, the DeepEncoder will segment it into 1024/16 × \times 1024/16=4096 patch tokens. Since the first half of encoder is dominated by window attention and only 80M, the activation is acceptable. Before entering global attention, the 4096 tokens go through the compression module and the token count becomes 4096/16=256, thus making the overall a...

3 Methodology

3.2 DeepEncoder 为探索上下文光学压缩的可行性，我们需要具有以下特性的视觉编码器：1. 能够处理高分辨率；2. 高分辨率下低激活；3. 少量视觉token；4. 支持多种分辨率输入；5. 适中的参数数量。然而，如第2.1节所述，当前开源编码器不能完全满足所有这些条件。因此，我们设计了DeepEncoder。

原文: ) respectively. Since Tiny and Small modes have relatively small resolutions, to avoid wasting vision tokens, images are processed by directly resizing the original shape. For Base and Large modes, in order to preserve the original image aspect ratio, images are padded to the corresponding size. After padding, the number of valid vision tokens is less than the actual number of vision tokens, with the calculation formula being: N v a l i d = ⌈ N a c t u a l × [ 1 − ( ( m a x ( w , h ) − m i n ( w , h ) ) / ( m a x ( w , h ) ) ) ] ⌉ N_{valid}=\lceil N_{actual}...

3 Methodology

姿态，我们设计了具有多种原生分辨率和动态分辨率模式的DeepEncoder。模式原生分辨率动态分辨率 Tiny Small Base Large Gundam Gundam-M 分辨率 512 640 1024 1280 640+1024 1024+1280 Token数量 64 100 256 400 n×100+256 n×256+400 处理 resize resize padding padding resize+padding resize+padding

原文: Gundam-master’s resolution is too large and training it together would slow down the overall training speed. 3.3 The MoE Decoder Our decoder uses the DeepSeekMoE [ 19 , 20 ] , specifically DeepSeek-3B-MoE. During inference, the model activates 6 out of 64 routed experts and 2 shared experts, with about 570M activated parameters. The 3B DeepSeekMoE is very suitable for domain-centric (OCR for us) VLM research, as it obtains the expressive capability of a 3B model while enjoying the inference efficiency of a 500M small model. The decoder reconstructs the original text representation from the com...

3 Methodology

3.2.2 多分辨率支持假设我们有一张包含1000个光学字符的图像，想知道需要多少视觉token进行解码。这需要模型支持可变数量的视觉token。动态分辨率可以由两种原生分辨率组成。例如，Gundam模式由n×640×640图块（局部视图）和1024×1024全局视图组成。

原文: tions display. We format the ground truth into an interleaved layout and text format, where each paragraph of text is preceded by the coordinates and label of it in the original image. All coordinates are normalized into 1000 bins. 3.4.1 OCR 1.0 data Document data is the top priority for DeepSeek-OCR. We collect 30M pages of diverse PDF data covering about 100 languages from the Internet, with Chinese and English accounting for approximately 25M and other languages accounting for 5M. For this data, we create two types of ground truth: coarse annotations and fine annotations. Coarse annotations...

3 Methodology

3.3 MoE解码器我们的解码器使用DeepSeekMoE，具体为DeepSeek-3B-MoE。推理期间，模型激活64个路由专家中的6个和2个共享专家，约5.7亿激活参数。3B DeepSeekMoE非常适合领域中心（对我们来说是OCR）的VLM研究，因为它获得了3B模型的表达能力，同时享有5亿小模型的推理效率。

原文: nese and English. Like document OCR, natural scene OCR can also control whether to output detection boxes through prompts. 3.4.2 OCR 2.0 data Following GOT-OCR2.0 [ 38 ] , we refer to chart, chemical formula, and plane geometry parsing data as OCR 2.0 data. For chart data, following OneChart [ 7 ] , we use pyecharts and matplotlib to render 10M images, mainly including commonly used line, bar, pie, and composite charts. We define chart parsing as image-to-HTML-table conversion task, as shown in Figure LABEL:fig:demo2-1 . For chemical formulas, we utilize SMILES format from PubChem as the data ...

3 Methodology

3.4 数据引擎我们为DeepSeek-OCR构建了复杂多样的训练数据，包括OCR 1.0数据（主要由传统OCR任务如场景图像OCR和文档OCR组成）；OCR 2.0数据（主要包括复杂人工图像的解析任务，如常见图表、化学公式和平面几何解析数据）；通用视觉数据（主要用于注入一定的通用视觉理解能力）。

原文: is not a general VLM model, and this portion of data accounts for only 20% of the total data. We introduce such type of data mainly to preserve the general vision interface, so that researchers interested in our model and general vision task can conveniently advance their work in the future. 3.4.4 Text-only data To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data, with all data processed to a length of 8192 tokens, which is also the sequence length for DeepSeek-OCR. In summary, when training DeepSeek-OCR, OCR data accounts for 70%, general vision ...

3 Methodology

3.4.1 OCR 1.0数据文档数据是DeepSeek-OCR的优先事项。我们从互联网收集了3000万页多样化的PDF数据，涵盖约100种语言，中文和英文。与文档OCR类似，自然场景OCR也可以通过提示控制是否输出检测框。

原文: , while treating the CLIP part as input embedding layer and place it in PP1 with unfrozen weights for training. For the language model part, since DeepSeek3B-MoE has 12 layers, we place 6 layers each on PP2 and PP3. We use 20 nodes (each with 8 A100-40G GPUs) for training, with a data parallelism (DP) of 40 and a global batch size of 640. We use the AdamW optimizer with a step-based scheduler and an initial learning rate of 3e-5. For text-only data, the training speed is 90B tokens/day, while for multimodal data, the training speed is 70B tokens/day. Table 2 : We test DeepSeek-OCR’s vision-tex...

3 Methodology

3.4.2 OCR 2.0数据遵循GOT-OCR2.0，我们将图表、化学公式和平面几何解析数据称为OCR 2.0数据。对于图表数据，遵循OneChart，我们使用pyecharts和matplotlib渲染1000万张图像，主要包括常用的折线图、柱状图、饼图等。

原文: all text formula table order Pipline Models Dolphin [ 11 ] - 0.356 0.352 0.465 0.258 0.35 0.44 0.44 0.604 0.367 0.351 Marker [ 1 ] - 0.296 0.085 0.374 0.609 0.116 0.497 0.293 0.688 0.678 0.329 Mathpix [ 2 ] - 0.191 0.105 0.306 0.243 0.108 0.364 0.381 0.454 0.32 0.30 MinerU-2.1.1 [ 34 ] - 0.162 0.072 0.313 0.166 0.097 0.244 0.111 0.581 0.15 0.136 MonkeyOCR-1.2B [ 18 ] - 0.154 0.062 0.295 0.164 0.094 0.263 0.179 0.464 0.168 0.243 PPstructure-v3 [ 9 ] - 0.152 0.073 0.295 0.162 0.077 0.223 0.136 0.535 0.111 0.11 End-to-end Models Nougat [ 6 ] 2352 0.452 0.365 0.488 0.572 0.382 0.973 0.998 0.941 1....

3 Methodology

这不是一般的VLM模型，这部分数据仅占总数据的20%。我们引入此类数据主要是为了保留通用视觉接口，使对我们的模型和通用视觉任务感兴趣的研究人员能够方便地推进他们的工作。3.4.4 纯文本数据：为确保模型的语言能力，我们引入了10%的纯文本预训练数据。

原文: 0.269 0.134 0.062 0.181 0.097 0.432 0.089 0.103 Gundam-M †200dpi 1853 0.123 0.049 0.242 0.147 0.056 0.157 0.087 0.377 0.08 0.085

3.1 Architecture

3.5 训练管道我们将CLIP部分作为输入嵌入层并放在PP1中，权重不冻结进行训练。对于语言模型部分，由于DeepSeek3B-MoE有12层，我们在PP2和PP3上各放置6层。我们使用20个节点（每个配备8个A100-40G GPU）进行训练，数据并行度为40，全局批量大小为640。

原文: As shown in Figure 3 , DeepSeek-OCR enjoys a unified end-to-end VLM architecture consisting of an encoder and a decoder. The encoder (namely DeepEncoder) is responsible for extracting image features and tokenizing as well as compressing visual representations. The decoder is used for generating the required result based on image tokens and prompts. DeepEncoder is approximately 380M in parameters, mainly composed of an 80M SAM-base [ 17 ] and a 300M CLIP-large [ 29 ] connected in series. The decoder adopts a 3B MoE [ 19 , 20 ] architecture with 570M activated parameters. In the following paragr...

3.2 DeepEncoder

性能对比表：与Dolphin、Marker、Mathpix、MinerU-2.1.1、MonkeyOCR-1.2B等模型的编辑距离比较结果。DeepSeek-OCR在多个指标上表现出色，特别是在文本识别和布局理解方面。

原文: To explore the feasibility of contexts optical compression, we need a vision encoder with the following features: 1.Capable of processing high resolutions; 2.Low activation at high resolutions; 3.Few vision tokens; 4.Support for multiple resolution inputs; 5. Moderate parameter count. However, as described in the Section 2.1 , current open-source encoders cannot fully satisfy all these conditions. Therefore, we design a novel vision encoder ourselves, named DeepEncoder. Figure 4 : To test model performance under different compression ratios (requiring different numbers of vision tokens) and en...

3.2 DeepEncoder

0.269 0.134 0.062 0.181 0.097 0.432 0.089 0.103 Gundam-M †200dpi 1853 0.123 0.049 0.242 0.147 0.056 0.157 0.087 0.377 0.08 0.085

原文: poses, we design DeepEncoder with diverse native resolution and dynamic resolution modes. Mode Native Resolution Dynamic Resolution Tiny Small Base Large Gundam Gundam-M Resolution 512 640 1024 1280 640+1024 1024+1280 Tokens 64 100 256 400 n × \times 100+256 n × \times 256+400 Process resize resize padding padding resize + padding resize + padding 3.2.2 Multiple resolution support Suppose we have an image with 1000 optical characters and we want to test how many vision tokens are needed for decoding. This requires the model to support a variable number of vision tokens. That is to say the Deep...

3.2 DeepEncoder

原文: nput image. Dynamic resolution can be composed of two native resolutions. For example, Gundam mode consists of n × \times 640 × \times 640 tiles (local views) and a 1024 × \times 1024 global view. The tiling method following InternVL2.0 [ 8 ] . Supporting dynamic resolution is mainly for application considerations, especially for ultra-high-resolution inputs (such as newspaper images). Tiling is a form of secondary window attention that can effectively reduce activation memory further. It’s worth noting that due to our relatively large native resolutions, images won’t be fragmented too much un...

3.3 The MoE Decoder

原文: Our decoder uses the DeepSeekMoE [ 19 , 20 ] , specifically DeepSeek-3B-MoE. During inference, the model activates 6 out of 64 routed experts and 2 shared experts, with about 570M activated parameters. The 3B DeepSeekMoE is very suitable for domain-centric (OCR for us) VLM research, as it obtains the expressive capability of a 3B model while enjoying the inference efficiency of a 500M small model. The decoder reconstructs the original text representation from the compressed latent vision tokens of DeepEncoder as: f dec : ℝ n × d latent → ℝ N × d text ; 𝐗 ^ = f dec ( 𝐙 ) where n ≤ N f_{\tex...

3.4 Data Engine

姿态，我们设计了具有多种原生分辨率和动态分辨率模式的DeepEncoder。模式包括Tiny、Small、Base、Large、Gundam和Gundam-M，分别对应不同的分辨率和token数量配置。

原文: We constructe complex and diverse training data for DeepSeek-OCR, including OCR 1.0 data, which mainly consists of traditional OCR tasks such as scene image OCR and document OCR; OCR 2.0 data, which mainly includes parsing tasks for complex artificial images, such as common charts, chemical formulas, and plane geometry parsing data; General vision data, which is mainly used to inject certain general image understanding capabilities into DeepSeek-OCR and preserve the general vision interface. (a) OCR 1.0 fine annotations display. We format the ground truth into an interleaved layout and text fo...

3.4 Data Engine

原文: ation image-text pairs can be seen in Figure 6(a) . We also collect 3M Word data, constructing high-quality image-text pairs without layout by directly extracting content. This data mainly brings benefits to formulas and HTML-formatted tables. Additionally, we select some open-source data [ 28 , 37 ] as supplements. For natural scene OCR, our model mainly supports Chinese and English. The image data sources come from LAION [ 31 ] and Wukong [ 13 ] , labeled using PaddleOCR [ 9 ] , with 10M data samples each for Chinese and English. Like document OCR, natural scene OCR can also control whether ...

3.4 Data Engine

原文: such as line segments, endpoint coordinates, line segment types, etc., for better readability. Each line segment is encoded using the Slow Perception [ 39 ] manner. 3.4.3 General vision data DeepEncoder can benefit from CLIP’s pretraining gains and has sufficient parameters to incorporate general visual knowledge. Therefore, we also prepare some corresponding data for DeepSeek-OCR. Following DeepSeek-VL2 [ 40 ] , we generate relevant data for tasks such as caption, detection, and grounding. Note that DeepSeek-OCR is not a general VLM model, and this portion of data accounts for only 20% of the...

3.5 Training Pipelines

原文: Our training pipeline is very simple and consists mainly of two stages: a).Training DeepEncoder independently; b).Training the DeepSeek-OCR. Note that the Gundam-master mode is obtained by continuing training on a pre-trained DeepSeek-OCR model with 6M sampled data. Since the training protocol is identical to other modes, we omit the detailed description hereafter. 3.5.1 Training DeepEncoder Following Vary [ 36 ] , we utilize a compact language model [ 15 ] and use the next token prediction framework to train DeepEncoder. In this stage, we use all OCR 1.0 and 2.0 data aforementioned, as well a...

3.5 Training Pipelines

原文: 21 ] benchmarks. Text tokens represent the number of tokens after tokenizing the ground truth text using DeepSeek-OCR’s tokenizer. Vision Tokens=64 or 100 respectively represent the number of vision tokens output by DeepEncoder after resizing input images to 512 × \times 512 and 640 × \times 640. Text Tokens Vision Tokens =64 Vision Tokens=100 Precision Compression Precision Compression Pages 600-700 96.5% 10.5 × \times 98.5% 6.7 × \times 7 700-800 93.8% 11.8 × \times 97.3% 7.5 × \times 28 800-900 83.8% 13.2 × \times 96.8% 8.5 × \times 28 900-1000 85.9% 15.1 × \times 96.8% 9.7 × \times 14 1000...

3.5 Training Pipelines

原文: 07 0.522 InternVL2-76B [ 8 ] 6790 0.44 0.353 0.543 0.547 0.317 0.443 0.29 0.701 0.555 0.228 Qwen2.5-VL-7B [ 5 ] 3949 0.316 0.151 0.376 0.598 0.138 0.399 0.243 0.5 0.627 0.226 OLMOCR [ 28 ] 3949 0.326 0.097 0.455 0.608 0.145 0.469 0.293 0.655 0.652 0.277 GOT-OCR2.0 [ 38 ] 256 0.287 0.189 0.360 0.459 0.141 0.411 0.315 0.528 0.52 0.28 OCRFlux-3B [ 3 ] 3949 0.238 0.112 0.447 0.269 0.126 0.349 0.256 0.716 0.162 0.263 GPT4o [ 26 ] - 0.233 0.144 0.425 0.234 0.128 0.399 0.409 0.606 0.329 0.251 InternVL3-78B [ 42 ] 6790 0.218 0.117 0.38 0.279 0.095 0.296 0.21 0.533 0.282 0.161 Qwen2.5-VL-72B [ 5 ] 3949...

4 Evaluation

原文: 4.1 Vision-text Compression Study We select Fox [ 21 ] benchmarks to verify DeepSeek-OCR’s compression-decompression capability for text-rich documents, in order to preliminarily explore the feasibility and boundaries of contexts optical compression. We use the English document portion of Fox, tokenize the ground truth text with DeepSeek-OCR’s tokenizer (vocabulary size of approximately 129k), and select documents with 600-1300 tokens for testing, which happens to be 100 pages. Since the number of text tokens is not large, we only need to test performance in Tiny and Small modes, where Tiny mo...

4 Evaluation

3.5 训练管道我们的训练管道非常简单，主要包括两个阶段：a) 独立训练DeepEncoder；b) 训练DeepSeek-OCR。注意Gundam-master模式是通过在预训练的DeepSeek-OCR模型上使用600万采样数据继续训练获得的。

原文: encoder. Table 4: Edit distances for different categories of documents in OmniDocBench. The results show that some types of documents can achieve good performance with just 64 or 100 vision tokens, while others require Gundam mode. Mode Type Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall Tiny 0.147 0.116 0.207 0.173 0.294 0.201 0.395 0.297 0.94 0.32 Small 0.085 0.111 0.079 0.147 0.171 0.107 0.131 0.187 0.744 0.205 Base 0.037 0.08 0.027 0.1 0.13 0.073 0.052 0.176 0.645 0.156 Large 0.038 0.108 0.022 0.084 0.109 0.06 0.053 0.155 0.353 0.117 Gunda...

4 Evaluation

性能对比表：与多个模型的基准测试结果比较。DeepSeek-OCR在多个指标上表现出色，特别是在文本识别和布局理解方面。 DeepSeek-OCR在多个指标上表现出色，特别是在文本识别和布局理解方面，相比现有模型有显著优势。

原文: o does not exceed 10 × \times . For newspapers, Gundam or even Gundam-master mode is required to achieve acceptable edit distances, because the text tokens in newspapers are 4-5,000, far exceeding the 10 × \times compression of other modes. These experimental results further demonstrate the boundaries of contexts optical compression, which may provide effective references for researches on the vision token optimization in VLMs and context compression, forgetting mechanisms in LLMs. 4.3 Qualitative Study 4.3.1 Deep parsing DeepSeek-OCR possesses both layout and OCR 2.0 capabilities, enabling it...

4 Evaluation

4 评估 4.1 视觉-文本压缩研究我们选择Fox基准来验证DeepSeek-OCR对文本丰富文档的压缩-解压能力，以初步探索上下文光学压缩的可行性和边界。我们使用Fox的英文文档部分，用DeepSeek-OCR的tokenizer（词汇量约12.9万）对真实文本进行token化。

原文: DF data on the Internet contains not only Chinese and English, but also a large amount of multilingual data, which is also crucial when training LLMs. For PDF documents, DeepSeek-OCR can handle nearly 100 languages. Like Chinese and English documents, multilingual data also supports both layout and non-layout OCR formats. The visualization results are shown in Figure 10 , where we select Arabic and Sinhala languages to demonstrate results. Figure 10 : To endow the capability of processing widely crawled PDFs (multilingual data), we train our model with OCR capabilities for nearly 100 languages...

4.1 Vision-text Compression Study

表4：OmniDocBench中不同类别文档的编辑距离。结果显示，某些类型的文档仅需64或100个视觉token就能取得良好性能，而其他类型则需要Gundam模式。这种灵活性使得DeepSeek-OCR能够在不同场景下选择最优的分辨率配置，平衡性能和效率。

原文: We select Fox [ 21 ] benchmarks to verify DeepSeek-OCR’s compression-decompression capability for text-rich documents, in order to preliminarily explore the feasibility and boundaries of contexts optical compression. We use the English document portion of Fox, tokenize the ground truth text with DeepSeek-OCR’s tokenizer (vocabulary size of approximately 129k), and select documents with 600-1300 tokens for testing, which happens to be 100 pages. Since the number of text tokens is not large, we only need to test performance in Tiny and Small modes, where Tiny mode corresponds to 64 tokens and Sm...

4.1 Vision-text Compression Study

不超过10×。对于报纸，需要Gundam甚至Gundam-master模式才能实现可接受的编辑距离，因为报纸中的文本token为4-5000个，远远超过其他模式的10×压缩。这些实验结果进一步证明了上下文光学压缩的边界，可能为视觉-文本压缩研究提供有效的参考。

原文: for different categories of documents in OmniDocBench. The results show that some types of documents can achieve good performance with just 64 or 100 vision tokens, while others require Gundam mode. Mode Type Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall Tiny 0.147 0.116 0.207 0.173 0.294 0.201 0.395 0.297 0.94 0.32 Small 0.085 0.111 0.079 0.147 0.171 0.107 0.131 0.187 0.744 0.205 Base 0.037 0.08 0.027 0.1 0.13 0.073 0.052 0.176 0.645 0.156 Large 0.038 0.108 0.022 0.084 0.109 0.06 0.053 0.155 0.353 0.117 Gundam 0.035 0.085 0.289 0.095 0.094 0...

4.2 OCR Practical Performance

互联网上的PDF数据不仅包含中英文，还包含大量多语言数据，这在训练LLM时也很重要。对于PDF文档，DeepSeek-OCR可以处理近100种语言。与中英文文档一样，多语言数据也支持布局和non-layout OCR格式。可视化结果如图10所示。

原文: DeepSeek-OCR is not only an experimental model; it has strong practical capabilities and can construct data for LLM/VLM pretraining. To quantify OCR performance, we test DeepSeek-OCR on OmniDocBench [ 27 ] , with results shown in Table 3 . Requiring only 100 vision tokens (640 × \times 640 resolution), DeepSeek-OCR surpasses GOT-OCR2.0 [ 38 ] which uses 256 tokens; with 400 tokens (285 valid tokens, 1280 × \times 1280 resolution), it achieves on-par performance with state-of-the-arts on this benchmark. Using fewer than 800 tokens (Gundam mode), DeepSeek-OCR outperforms MinerU2.0 [ 34 ] which n...

4.3 Qualitative Study

4.1 视觉-文本压缩研究我们选择Fox基准来验证DeepSeek-OCR对文本丰富文档的压缩-解压能力，以初步探索上下文光学压缩的可行性和边界。我们使用Fox的英文文档部分，用DeepSeek-OCR的tokenizer（词汇量约12.9万）对真实文本进行token化。

原文: 4.3.1 Deep parsing DeepSeek-OCR possesses both layout and OCR 2.0 capabilities, enabling it to further parse images within documents through secondary model calls, a feature we refer to as "deep parsing". As shown in Figures 6 , 7 , 8 , 9 , our model can perform deep parsing on charts, geometry, chemical formulas, and even natural images, requiring only a unified prompt. Figure 6 : In the field of financial research reports, the deep parsing mode of DeepSeek-OCR can be used to obtain structured results of charts within documents. Charts are a crucial form of data representation in finance and ...

4.3 Qualitative Study

4.2 OCR实际性能 DeepSeek-OCR不仅是一个实验模型；它具有强大的实际能力，可以为LLM/VLM预训练构建数据。为量化OCR性能，我们在OmniDocBench上测试了DeepSeek-OCR。仅需100个视觉token（640×640分辨率），DeepSeek-OCR就超过了使用256个token的GOT-OCR2.0。

原文: PDFs (multilingual data), we train our model with OCR capabilities for nearly 100 languages. Minority language documents can also support both layout and non-layout outputs through different prompts. 4.3.3 General vision understanding We also provide DeepSeek-OCR with a certain degree of general image understanding capabilities. The related visualization results are shown in Figure 11 . Figure 11 : We retain DeepSeek-OCR’s capabilities in general visual understanding, mainly including image description, object detection, grounding, etc. Meanwhile, due to the inclusion of text-only data, DeepSe...

5 Discussion

4.3 定性研究 4.3.1 深度解析 DeepSeek-OCR同时具有布局和OCR 2.0能力，能够通过二次模型调用进一步解析文档中的图像，我们称之为深度解析。如图6、7、8、9所示，我们的模型可以对图表、几何、化学公式甚至自然图像进行深度解析。

原文: Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode N N text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10 × \times ratios, while 20 × \times compression still retains 60% accuracy. These findings suggest promising directions for future applications, such as implementing optical processing for dialogue histories beyond k k rounds in multi-turn conversations to achieve 10 × \times compression efficiency. For older contexts,...

6 Conclusion

4.3.2 多语言支持对于PDF（多语言数据），我们训练我们的模型具备近100种语言的OCR能力。少数民族语言文档也可以通过不同提示支持布局和non-layout输出。这种多语言能力使得DeepSeek-OCR成为全球文档处理的有力工具，适用于各种语言环境。

原文: In this technical report, we propose DeepSeek-OCR and preliminarily validate the feasibility of contexts optical compression through this model, demonstrating that the model can effectively decode text tokens exceeding 10 times the quantity from a small number of vision tokens. We believe this finding will facilitate the development of VLMs and LLMs in the future. Additionally, DeepSeek-OCR is a highly practical model capable of large-scale pretraining data production, serving as an indispensable assistant for LLMs. Of course, OCR alone is insufficient to fully validate true context optical co...

← 返回首页详细解读