[原文]Haoran Wei Yaofeng Sun Yukun Li DeepSeek-AI Abstract We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compr...
[原文]textual information. A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text, suggesting that optical compression through vision tokens could achieve much higher compression ratios. This insight motivates us to reexamine vision-language models (VLMs) from an LLM-centric perspective, focusing on how vision encoders can enhance LLMs’ efficiency in processing textual information rather than basic VQA [ 12 , 24 , 32 , 41 , 16 ] what humans excel at. OCR tasks, as an intermediate modality bridging vision and language, ...
[原文]the window attention component processes a large number of vision tokens, while the compressor reduces vision tokens before they enter the dense global attention component, achieving effective memory and token compression. Third, we develop DeepSeek-OCR based on DeepEncoder and DeepSeek3B-MoE [ 19 , 20 ] . As shown in Figure LABEL:fig:omni , it achieves state-of-the-art performance within end-to-end models on OmniDocBench while using the fewest vision tokens. Additionally, we equip the model with capabilities for parsing charts, chemical formulas, simple geometric figures, and natural images t...
[原文]first type is a dual-tower architecture represented by Vary [ 36 ] , which utilizes parallel SAM [ 17 ] encoder to increase visual vocabulary parameters for high-resolution image processing. While offering controllable parameters and activation memory, this approach suffers from significant drawbacks: it requires dual image preprocessing that complicates deployment and makes encoder pipeline parallelism challenging during training. The second type is tile-based method exemplified by InternVL2.0 [ 8 ] , which processes images by dividing them into small tiles for parallel computation, reducing ...
[原文]sks. GOT-OCR2.0 [ 38 ] expands the scope of OCR2.0 to include more synthetic image parsing tasks and designs an OCR model with performance-efficiency trade-offs, further highlighting the potential of end-to-end OCR researches. Additionally, general vision models such as Qwen-VL series [ 35 ] , InternVL series [ 8 ] , and many their derivatives continuously enhance their document OCR capabilities to explore dense visual perception boundaries. However, a crucial research question that current models have not addressed is: for a document containing 1000 words, how many vision tokens are at least ...
[原文]tivation at high resolutions; 3.Few vision tokens; 4.Support for multiple resolution inputs; 5. Moderate parameter count. However, as described in the Section 2.1 , current open-source encoders cannot fully satisfy all these conditions. Therefore, we design a novel vision encoder ourselves, named DeepEncoder. Figure 4 : To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple resolution modes. 3.2.1 Architecture of DeepEncoder DeepEncoder mainly consists of two compone...
DeepSeek-OCR: Contexts Optical Compression
Note: The original text has `Pipline Models` and `End-to-end Models` as section headers within the table. I'll keep them as row headers or just translate them inline. I will format it cleanly. Also, `without layout: "\nFree OCR." to control the model’s output 尽管如此,输出格式仍无法完全匹配Fox基准测试,因此实际性能可能会略高于测试结果。如表2所示,在10倍压缩率以内,模型的解码精度可达约97%,这是一个非常令人鼓舞的结果。未来,通过文本到图像的方法,有望实现接近10倍的无损上下文压缩。当压缩率超过10倍时,性能开始下降,这可能有两方面原因:一是长文档的排版变得更加复杂;二是在512×512或640×640分辨率下,长文本会出现模糊现象。第一个问题可以通过将文本渲染至单一排版页面来解决,而我们认为第二个问题将成为遗忘机制的一个特征。当Token压缩率接近20倍时,我们发现精度仍可接近60%。这些结果表明,光学上下文压缩是一个极具前景且值得深入研究的方向。该方法不会带来额外开销,因为它可以复用VLM(视觉语言模型)的基础设施,而多模态系统本身就需要额外的视觉编码器。
[原文]Table 3 | We use OmniDocBench [27] to test the performance of DeepSeek-OCR on real document parsing tasks. All metrics in the table are edit distances, where smaller values indicate better performance. "Tokens" represents the average number of vision tokens used per page, and "†200dpi" means using fitz to interpolate the original image to 200dpi. For the DeepSeek-OCR model, the values in parentheses in the "Tokens" column represent valid vision tokens, calculated according to Equation 1. Model Tokens English Chinese overall text formula table order overall text formula table order Pipline Mode...
[原文]Current Large Language Models (LLMs) face significant computational challenges when processing long textual content due to quadratic scaling with sequence length. We explore a potential solution: leveraging visual modality as an efficient compression medium for textual information. A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text, suggesting that optical compression through vision tokens could achieve much higher compression ratios. This insight motivates us to reexamine vision-language models (VLMs) from a...
[原文]architecture that maintains low activation memory and minimal vision tokens even with high-resolution inputs. It serially connects window attention and global attention encoder components through a 16 × \times convolutional compressor. This design ensures that the window attention component processes a large number of vision tokens, while the compressor reduces vision tokens before they enter the dense global attention component, achieving effective memory and token compression. Third, we develop DeepSeek-OCR based on DeepEncoder and DeepSeek3B-MoE [ 19 , 20 ] . As shown in Figure LABEL:fig:om...
[原文]first employs end-to-end framework for academic paper OCR on arXiv, demonstrating the potential of models in handling dense perception tasks. GOT-OCR2.0 [ 38 ] expands the scope of OCR2.0 to include more synthetic image parsing tasks and designs an OCR model with performance-efficiency trade-offs, further highlighting the potential of end-to-end OCR researches. Additionally, general vision models such as Qwen-VL series [ 35 ] , InternVL series [ 8 ] , and many their derivatives continuously enhance their document OCR capabilities to explore dense visual perception boundaries. However, a crucia...
[原文]Current open-source VLMs employ three main types of vision encoders, as illustrated in Figure 2 . The first type is a dual-tower architecture represented by Vary [ 36 ] , which utilizes parallel SAM [ 17 ] encoder to increase visual vocabulary parameters for high-resolution image processing. While offering controllable parameters and activation memory, this approach suffers from significant drawbacks: it requires dual image preprocessing that complicates deployment and makes encoder pipeline parallelism challenging during training. The second type is tile-based method exemplified by InternVL2....
[原文]OCR, particularly document parsing task, has been a highly active topic in the image-to-text domain. With the advancement of VLMs, a large number of end-to-end OCR models have emerged, fundamentally transforming the traditional pipeline architecture (which required separate detection and recognition expert models) by simplifying OCR systems. Nougat [ 6 ] first employs end-to-end framework for academic paper OCR on arXiv, demonstrating the potential of models in handling dense perception tasks. GOT-OCR2.0 [ 38 ] expands the scope of OCR2.0 to include more synthetic image parsing tasks and desig...
[原文]36 ] and use a 2-layer convolutional module to perform 16 × \times downsampling of vision tokens. Each convolutional layer has a kernel size of 3, stride of 2, padding of 1, and channels increase from 256 to 1024. Assuming we input a 1024 × \times 1024 image, the DeepEncoder will segment it into 1024/16 × \times 1024/16=4096 patch tokens. Since the first half of encoder is dominated by window attention and only 80M, the activation is acceptable. Before entering global attention, the 4096 tokens go through the compression module and the token count becomes 4096/16=256, thus making the overall a...
[原文]) respectively. Since Tiny and Small modes have relatively small resolutions, to avoid wasting vision tokens, images are processed by directly resizing the original shape. For Base and Large modes, in order to preserve the original image aspect ratio, images are padded to the corresponding size. After padding, the number of valid vision tokens is less than the actual number of vision tokens, with the calculation formula being: N v a l i d = ⌈ N a c t u a l × [ 1 − ( ( m a x ( w , h ) − m i n ( w , h ) ) / ( m a x ( w , h ) ) ) ] ⌉ N_{valid}=\lceil N_{actual}...
[原文]Gundam-master’s resolution is too large and training it together would slow down the overall training speed. 3.3 The MoE Decoder Our decoder uses the DeepSeekMoE [ 19 , 20 ] , specifically DeepSeek-3B-MoE. During inference, the model activates 6 out of 64 routed experts and 2 shared experts, with about 570M activated parameters. The 3B DeepSeekMoE is very suitable for domain-centric (OCR for us) VLM research, as it obtains the expressive capability of a 3B model while enjoying the inference efficiency of a 500M small model. The decoder reconstructs the original text representation from the com...
3 Methodology
标注展示。我们将 ground truth 格式化为交错的 layout 与 text 格式,每段文本前标注其在原图中的坐标和标签。所有坐标归一化至 1000 个 bin。3.4.1 OCR 1.0 数据 文档数据是 DeepSeek-OCR 的首要任务。我们从互联网收集 3000 万页涵盖约 100 种语言的多样化 PDF 数据,其中中英文约 2500 万页,其他语言约 500 万页。对此数据,我们创建两类 ground truth:粗标注和细标注。粗标注……
[原文]tions display. We format the ground truth into an interleaved layout and text format, where each paragraph of text is preceded by the coordinates and label of it in the original image. All coordinates are normalized into 1000 bins. 3.4.1 OCR 1.0 data Document data is the top priority for DeepSeek-OCR. We collect 30M pages of diverse PDF data covering about 100 languages from the Internet, with Chinese and English accounting for approximately 25M and other languages accounting for 5M. For this data, we create two types of ground truth: coarse annotations and fine annotations. Coarse annotations...
[原文]nese and English. Like document OCR, natural scene OCR can also control whether to output detection boxes through prompts. 3.4.2 OCR 2.0 data Following GOT-OCR2.0 [ 38 ] , we refer to chart, chemical formula, and plane geometry parsing data as OCR 2.0 data. For chart data, following OneChart [ 7 ] , we use pyecharts and matplotlib to render 10M images, mainly including commonly used line, bar, pie, and composite charts. We define chart parsing as image-to-HTML-table conversion task, as shown in Figure LABEL:fig:demo2-1 . For chemical formulas, we utilize SMILES format from PubChem as the data ...
[原文]is not a general VLM model, and this portion of data accounts for only 20% of the total data. We introduce such type of data mainly to preserve the general vision interface, so that researchers interested in our model and general vision task can conveniently advance their work in the future. 3.4.4 Text-only data To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data, with all data processed to a length of 8192 tokens, which is also the sequence length for DeepSeek-OCR. In summary, when training DeepSeek-OCR, OCR data accounts for 70%, general vision ...
[原文], while treating the CLIP part as input embedding layer and place it in PP1 with unfrozen weights for training. For the language model part, since DeepSeek3B-MoE has 12 layers, we place 6 layers each on PP2 and PP3. We use 20 nodes (each with 8 A100-40G GPUs) for training, with a data parallelism (DP) of 40 and a global batch size of 640. We use the AdamW optimizer with a step-based scheduler and an initial learning rate of 3e-5. For text-only data, the training speed is 90B tokens/day, while for multimodal data, the training speed is 70B tokens/day. Table 2 : We test DeepSeek-OCR’s vision-tex...
[原文]As shown in Figure 3 , DeepSeek-OCR enjoys a unified end-to-end VLM architecture consisting of an encoder and a decoder. The encoder (namely DeepEncoder) is responsible for extracting image features and tokenizing as well as compressing visual representations. The decoder is used for generating the required result based on image tokens and prompts. DeepEncoder is approximately 380M in parameters, mainly composed of an 80M SAM-base [ 17 ] and a 300M CLIP-large [ 29 ] connected in series. The decoder adopts a 3B MoE [ 19 , 20 ] architecture with 570M activated parameters. In the following paragr...
[原文]To explore the feasibility of contexts optical compression, we need a vision encoder with the following features: 1.Capable of processing high resolutions; 2.Low activation at high resolutions; 3.Few vision tokens; 4.Support for multiple resolution inputs; 5. Moderate parameter count. However, as described in the Section 2.1 , current open-source encoders cannot fully satisfy all these conditions. Therefore, we design a novel vision encoder ourselves, named DeepEncoder. Figure 4 : To test model performance under different compression ratios (requiring different numbers of vision tokens) and en...
[原文]n=6 R=1-(H-W)/W W:512||640 H:512||640 Mode: Tiny||Small W:1024||1280 Resize Padding H:1024||1280 Mode: Base||Large Token: 64||100 Token: 256||400 Valid: (256||400)×R + 640||1024 640||1024 W:1024||1280 H:1024||1280 Mode: Gundam||Gundam (Master) Token: n×(100||256) + (256||400) Valid: n×(100||256) + (256||400)×R n∈[2:9] Figure 4 | To test model performance under different compression ratios (requiring different numbers of vision tokens) and enhance the practicality of DeepSeek-OCR, we configure it with multiple resolution modes. the 4096 tokens go through the compression module and the token cou...
[原文]nput image. Dynamic resolution can be composed of two native resolutions. For example, Gundam mode consists of n × \times 640 × \times 640 tiles (local views) and a 1024 × \times 1024 global view. The tiling method following InternVL2.0 [ 8 ] . Supporting dynamic resolution is mainly for application considerations, especially for ultra-high-resolution inputs (such as newspaper images). Tiling is a form of secondary window attention that can effectively reduce activation memory further. It’s worth noting that due to our relatively large native resolutions, images won’t be fragmented too much un...
[原文]Our decoder uses the DeepSeekMoE [ 19 , 20 ] , specifically DeepSeek-3B-MoE. During inference, the model activates 6 out of 64 routed experts and 2 shared experts, with about 570M activated parameters. The 3B DeepSeekMoE is very suitable for domain-centric (OCR for us) VLM research, as it obtains the expressive capability of a 3B model while enjoying the inference efficiency of a 500M small model. The decoder reconstructs the original text representation from the compressed latent vision tokens of DeepEncoder as: f dec : ℝ n × d latent → ℝ N × d text ; 𝐗 ^ = f dec ( 𝐙 ) where n ≤ N f_{\tex...
[原文]We constructe complex and diverse training data for DeepSeek-OCR, including OCR 1.0 data, which mainly consists of traditional OCR tasks such as scene image OCR and document OCR; OCR 2.0 data, which mainly includes parsing tasks for complex artificial images, such as common charts, chemical formulas, and plane geometry parsing data; General vision data, which is mainly used to inject certain general image understanding capabilities into DeepSeek-OCR and preserve the general vision interface. (a) OCR 1.0 fine annotations display. We format the ground truth into an interleaved layout and text fo...
[原文]ation image-text pairs can be seen in Figure 6(a) . We also collect 3M Word data, constructing high-quality image-text pairs without layout by directly extracting content. This data mainly brings benefits to formulas and HTML-formatted tables. Additionally, we select some open-source data [ 28 , 37 ] as supplements. For natural scene OCR, our model mainly supports Chinese and English. The image data sources come from LAION [ 31 ] and Wukong [ 13 ] , labeled using PaddleOCR [ 9 ] , with 10M data samples each for Chinese and English. Like document OCR, natural scene OCR can also control whether ...
[原文]such as line segments, endpoint coordinates, line segment types, etc., for better readability. Each line segment is encoded using the Slow Perception [ 39 ] manner. 3.4.3 General vision data DeepEncoder can benefit from CLIP’s pretraining gains and has sufficient parameters to incorporate general visual knowledge. Therefore, we also prepare some corresponding data for DeepSeek-OCR. Following DeepSeek-VL2 [ 40 ] , we generate relevant data for tasks such as caption, detection, and grounding. Note that DeepSeek-OCR is not a general VLM model, and this portion of data accounts for only 20% of the...
[原文]Our training pipeline is very simple and consists mainly of two stages: a).Training DeepEncoder independently; b).Training the DeepSeek-OCR. Note that the Gundam-master mode is obtained by continuing training on a pre-trained DeepSeek-OCR model with 6M sampled data. Since the training protocol is identical to other modes, we omit the detailed description hereafter. 3.5.1 Training DeepEncoder Following Vary [ 36 ] , we utilize a compact language model [ 15 ] and use the next token prediction framework to train DeepEncoder. In this stage, we use all OCR 1.0 and 2.0 data aforementioned, as well a...
[原文]encoder. Table 4: Edit distances for different categories of documents in OmniDocBench. The results show that some types of documents can achieve good performance with just 64 or 100 vision tokens, while others require Gundam mode. Mode Type Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall Tiny 0.147 0.116 0.207 0.173 0.294 0.201 0.395 0.297 0.94 0.32 Small 0.085 0.111 0.079 0.147 0.171 0.107 0.131 0.187 0.744 0.205 Base 0.037 0.08 0.027 0.1 0.13 0.073 0.052 0.176 0.645 0.156 Large 0.038 0.108 0.022 0.084 0.109 0.06 0.053 0.155 0.353 0.117 Gunda...
[原文]o does not exceed 10 × \times . For newspapers, Gundam or even Gundam-master mode is required to achieve acceptable edit distances, because the text tokens in newspapers are 4-5,000, far exceeding the 10 × \times compression of other modes. These experimental results further demonstrate the boundaries of contexts optical compression, which may provide effective references for researches on the vision token optimization in VLMs and context compression, forgetting mechanisms in LLMs. 4.3 Qualitative Study 4.3.1 Deep parsing DeepSeek-OCR possesses both layout and OCR 2.0 capabilities, enabling it...
[原文]DF data on the Internet contains not only Chinese and English, but also a large amount of multilingual data, which is also crucial when training LLMs. For PDF documents, DeepSeek-OCR can handle nearly 100 languages. Like Chinese and English documents, multilingual data also supports both layout and non-layout OCR formats. The visualization results are shown in Figure 10 , where we select Arabic and Sinhala languages to demonstrate results. Figure 10 : To endow the capability of processing widely crawled PDFs (multilingual data), we train our model with OCR capabilities for nearly 100 languages...
[原文]for different categories of documents in OmniDocBench. The results show that some types of documents can achieve good performance with just 64 or 100 vision tokens, while others require Gundam mode. Mode Type Book Slides Financial Report Textbook Exam Paper Magazine Academic Papers Notes Newspaper Overall Tiny 0.147 0.116 0.207 0.173 0.294 0.201 0.395 0.297 0.94 0.32 Small 0.085 0.111 0.079 0.147 0.171 0.107 0.131 0.187 0.744 0.205 Base 0.037 0.08 0.027 0.1 0.13 0.073 0.052 0.176 0.645 0.156 Large 0.038 0.108 0.022 0.084 0.109 0.06 0.053 0.155 0.353 0.117 Gundam 0.035 0.085 0.289 0.095 0.094 0...
[原文]DeepSeek-OCR is not only an experimental model; it has strong practical capabilities and can construct data for LLM/VLM pretraining. To quantify OCR performance, we test DeepSeek-OCR on OmniDocBench [ 27 ] , with results shown in Table 3 . Requiring only 100 vision tokens (640 × \times 640 resolution), DeepSeek-OCR surpasses GOT-OCR2.0 [ 38 ] which uses 256 tokens; with 400 tokens (285 valid tokens, 1280 × \times 1280 resolution), it achieves on-par performance with state-of-the-arts on this benchmark. Using fewer than 800 tokens (Gundam mode), DeepSeek-OCR outperforms MinerU2.0 [ 34 ] which n...
[原文]PDFs (multilingual data), we train our model with OCR capabilities for nearly 100 languages. Minority language documents can also support both layout and non-layout outputs through different prompts. 4.3.3 General vision understanding We also provide DeepSeek-OCR with a certain degree of general image understanding capabilities. The related visualization results are shown in Figure 11 . Figure 11 : We retain DeepSeek-OCR’s capabilities in general visual understanding, mainly including image description, object detection, grounding, etc. Meanwhile, due to the inclusion of text-only data, DeepSe...
5 Discussion
我们的工作是对视觉-文本压缩边界的初步探索,研究解码 N 个 text token 需要多少 vision token。初步结果令人鼓舞:DeepSeek-OCR 在约 10× 压缩比下实现近无损 OCR 压缩,20× 压缩仍保留约 60% 准确率。这些发现为未来应用指明方向,例如对多轮对话中超过 k 轮的对话历史实施光学处理以实现 10× 压缩效率。对于更早的 context……
[原文]Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode N N text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10 × \times ratios, while 20 × \times compression still retains 60% accuracy. These findings suggest promising directions for future applications, such as implementing optical processing for dialogue histories beyond k k rounds in multi-turn conversations to achieve 10 × \times compression efficiency. For older contexts,...
[原文]In this technical report, we propose DeepSeek-OCR and preliminarily validate the feasibility of contexts optical compression through this model, demonstrating that the model can effectively decode text tokens exceeding 10 times the quantity from a small number of vision tokens. We believe this finding will facilitate the development of VLMs and LLMs in the future. Additionally, DeepSeek-OCR is a highly practical model capable of large-scale pretraining data production, serving as an indispensable assistant for LLMs. Of course, OCR alone is insufficient to fully validate true context optical co...