← 首页 | 导读 | 详细解读

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2:基于混合专家架构的先进多模态视觉语言模型

📄 arXiv: 2412.10302📅 2024-12-12PDF
翻译进度47 / 47 段 (100%)

中文摘要

DeepSeek-VL2 首次将 MoE(混合专家)架构应用于视觉语言模型,支持超高分辨率图像理解和复杂视觉推理。采用创新的视觉 token 压缩技术和动态专家路由机制,在保持高性能的同时大幅降低计算成本。在文档理解、图表分析、科学图表理解等任务上显著超越前人工作。

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

【摘要】DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding - 本文介绍了DeepSeek-VL2的架构、训练方法和实验结果。
原文: Zhiyu Wu ∗ Xiaokang Chen ∗ Zizheng Pan ∗ Xingchao Liu ∗ Wen Liu ∗,† Damai Dai Huazuo Gao Yiyang Ma Chengyue Wu Bingxuan Wang Zhenda Xie Yu Wu Kai Hu Jiawei Wang Yaofeng Sun Yukun Li Yishi Piao Kang Guan Aixin Liu Xin Xie Yuxiang You Kai Dong Xingkai Yu Haowei Zhang Liang Zhao Yisong Wang Chong Ruan ‡ DeepSeek-AI Abstract We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strateg...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

(DeepSeek-VL2: Mixture-of-Experts Vision-Language M - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 9 , 94 , 63 , 83 , 15 , 88 ] , extending the remarkable capabilities of Large Language Models (LLMs) to seamlessly process both visual and textual information. This advancement has dramatically expanded the potential for AI systems to tackle complex real-world applications that require multimodal understanding. In this technical report, we present DeepSeek-VL2, a new series of open-source Vision-Language Models that leverages the Mixture-of-Experts (MoE) architecture to achieve substantial improvements in both performance and efficiency compared to its predecessor, DeepSeek-VL [ 59 ] . Our adv...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

(DeepSeek-VL2: Mixture-of-Experts Vision-Language M - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: rich feature extraction without the quadratic computational scaling typically associated with increasing image resolutions. For the language component, we leverage DeepSeek language models [ 20 , 53 ] , featuring the Multi-head Latent Attention (MLA) mechanism. MLA significantly reduces computational cost by compressing the Key-Value (KV) cache into a latent vector, resulting in faster inference and increased throughput capacity. We further enhance efficiency through the DeepSeekMoE framework [ 20 , 86 ] , which employs sparse computation techniques. Our model series adopt three MoE variants, ...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

(DeepSeek-VL2: Mixture-of-Experts Vision-Language M - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: r, DeepSeek-VL2 introduces two major advancements: a dynamic tiling strategy and a DeepSeekMOE [ 20 , 86 ] language model featuring Multi-head Latent Attention [ 53 ] . These innovations enable more efficient processing of both high-resolution visual inputs and text data. Dynamic Tiling Strategy. The original DeepSeek-VL employed a hybrid vision encoder combining SigLIP [ 106 ] for coarse-grained feature extraction at 384 × 384 384\times 384 resolution and SAM-B [ 35 ] for fine-grained feature extraction at 1024 × 1024 1024\times 1024 resolution. While this fusion approach generated rich visua...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

(DeepSeek-VL2: Mixture-of-Experts Vision-Language M - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: esolution ( m i ⋅ 384 , n i ⋅ 384 ) (m_{i}\cdot 384,n_{i}\cdot 384) that minimizes the padding area. The resized image is then divided into m i × n i m_{i}\times n_{i} local tiles of 384 × 384 384\times 384 pixels, plus one global thumbnail tile. The SigLIP-SO400M-384 vision encoder processes all ( 1 + m i × n i ) (1+m_{i}\times n_{i}) tiles, yielding 27 × 27 = 729 27\times 27=729 visual embeddings of 1152 1152 dimensions per tile. For computational efficiency and context length management, we disable the dynamic tiling strategy when processing multiple ( > 2 ) (>2) images. Figure 3: Illustrat...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

(DeepSeek-VL2: Mixture-of-Experts Vision-Language M - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Figure 3 . DeepSeekMoE LLM. Our language model is based on DeepSeekMoE [ 20 , 86 ] , which incorporates the Multi-head Latent Attention mechanism [ 53 ] . MLA enhances inference efficiency by compressing the Key-Value cache into a latent vector, enabling increased throughput capacity. The model also incorporates a MoE architecture [ 20 ] allowing for efficient inference through sparse computation. During MoE training, we introduce a global bias term [ 86 ] for each expert to cost-effectively improve load balancing between experts. DeepSeek-VL2 comes in three variants with the following model s...

1 Introduction

【引言】DeepSeek-VL2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Large Vision-Language Models (VLMs) have emerged as a transformative force in artificial intelligence [ 54 , 59 , 94 , 63 , 83 , 15 , 88 ] , extending the remarkable capabilities of Large Language Models (LLMs) to seamlessly process both visual and textual information. This advancement has dramatically expanded the potential for AI systems to tackle complex real-world applications that require multimodal understanding. In this technical report, we present DeepSeek-VL2, a new series of open-source Vision-Language Models that leverages the Mixture-of-Experts (MoE) architecture to achieve substan...

1 Introduction

【引言】DeepSeek-VL2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: the language model. This design preserves the advantages of vision transformers with local attention, enabling rich feature extraction without the quadratic computational scaling typically associated with increasing image resolutions. For the language component, we leverage DeepSeek language models [ 20 , 53 ] , featuring the Multi-head Latent Attention (MLA) mechanism. MLA significantly reduces computational cost by compressing the Key-Value (KV) cache into a latent vector, resulting in faster inference and increased throughput capacity. We further enhance efficiency through the DeepSeekMoE f...

2 Model Architecture

【架构】DeepSeek-VL2的模型架构设计和技术细节。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: DeepSeek-VL2 consists of three core modules: (1) a vision encoder, (2) a vision-language adaptor, and (3) a Mixture-of-Experts language model. Building upon the decoder-only LLaVA-style [ 54 ] architecture of its predecessor, DeepSeek-VL2 introduces two major advancements: a dynamic tiling strategy and a DeepSeekMOE [ 20 , 86 ] language model featuring Multi-head Latent Attention [ 53 ] . These innovations enable more efficient processing of both high-resolution visual inputs and text data. Dynamic Tiling Strategy. The original DeepSeek-VL employed a hybrid vision encoder combining SigLIP [ 10...

2 Model Architecture

【架构】DeepSeek-VL2的模型架构设计和技术细节。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We first resize the original image until its long side matches the target resolution, then pad the other dimension while maintaining the original aspect ratio. it to each candidate resolution in C R C_{R} . We select the resolution ( m i ⋅ 384 , n i ⋅ 384 ) (m_{i}\cdot 384,n_{i}\cdot 384) that minimizes the padding area. The resized image is then divided into m i × n i m_{i}\times n_{i} local tiles of 384 × 384 384\times 384 pixels, plus one global thumbnail tile. The SigLIP-SO400M-384 vision encoder processes all ( 1 + m i × n i ) (1+m_{i}\times n_{i}) tiles, yielding 27 × 27 = 729 27\times 2...

2 Model Architecture

【架构】DeepSeek-VL2的模型架构设计和技术细节。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: imes(n_{i}\cdot 14+1) visual tokens, which are subsequently projected into the language model’s embedding space using a two-layer multilayer perceptron (MLP). A visual illustration of our dynamic tiling strategy is shown in Figure 3 . DeepSeekMoE LLM. Our language model is based on DeepSeekMoE [ 20 , 86 ] , which incorporates the Multi-head Latent Attention mechanism [ 53 ] . MLA enhances inference efficiency by compressing the Key-Value cache into a latent vector, enabling increased throughput capacity. The model also incorporates a MoE architecture [ 20 ] allowing for efficient inference thr...

3 Data Construction

【数据/训练】DeepSeek-VL2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: We build a comprehensive Vision-Language dataset from diverse sources for DeepSeek-VL2. The training process is structured into three distinct stages: (1) VL alignment, (2) VL pretraining, and (3) supervised fine-tuning (SFT). In the following parts, we provide descriptions of the data used in each stage. 3.1 Vision-Language Alignment Data The alignment stage focuses on training the MLP connector to bridge the pretrained visual encoder and the LLM. For this initial warmup phase, we utilize ShareGPT4V [ 12 ] , a dataset containing approximately 1.2M caption and conversation samples. 3.2 Vision-...

3 Data Construction

【数据/训练】DeepSeek-VL2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: roblematic cases with brief descriptions, mismatched text pairs, or obvious hallucinations. To address these quality inconsistencies, we developed a comprehensive image captioning pipeline that considers: (1) OCR hints, (2) meta information (e.g., location, camera settings), and (3) relevant original captions as prompts. Using an in-house captioner, we recaption the images following prompting strategies similar to PixelProse [ 78 ] , employing varied instructions to guide the VLM’s caption generation. Despite the overall improvement in caption quality, we observed repetition issues in the larg...

3 Data Construction

【数据/训练】DeepSeek-VL2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Websight using DeepSeek V2.5. We also exploit Python plot codes generated by DeepSeek V2.5 to mitigate the noises in the plot-to-code data. • QA with visual prompt. We follow [ 9 ] to construct visual prompt understanding data by overlaying various visual indicators (arrows, boxes, circles, and scribbles) onto images from [ 9 , 89 , 90 ] . We then created QA pairs focusing on objects highlighted by these visual prompts. Visual grounding data. We construct our visual grounding dataset from [ 71 , 75 ] . For each image’s object detection annotations, we structure the data as follows: • Prompt: L...

3 Data Construction

【数据/训练】DeepSeek-VL2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: -quality in-house QA pairs. Below, we detail our efforts to enhance the quality of our SFT dataset. General visual question-answering. While public visual QA datasets are diverse [ 74 , 10 , 43 , 9 , 27 , 31 , 47 ] , they often suffer from three main limitations: (1) short responses, (2) poor OCR quality, and (3) hallucinated content. To address these issues, we regenerate responses by jointly considering the original questions, images, and OCR information. Our experiments demonstrate that this approach produces more comprehensive and accurate results. During development, we observed that an e...

3 Data Construction

(3 Data Construction - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: nced table-based QA data by regenerating responses for all public datasets [ 14 , 49 ] based on their original questions except Cauldron [ 43 ] , which already exhibits high quality. Similar to our OCR capabilities developed during VL pretraining, our model demonstrated strong performance in chart understanding without requiring additional efforts. Reasoning, logic, and mathematics. We enhance public reasoning-focused datasets [ 76 , 43 , 61 , 17 , 102 , 109 ] with more detailed reasoning processes and standardize response formats which puts the final answer at the end of the response. We obse...

3 Data Construction

(3 Data Construction - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: sents phrases like “an object within the red bounding box” while is the model’s description of the detected object (e.g., “cat”). Grounded conversation. We construct our grounded conversation data using [ 62 , 72 ] to further enhance the model’s capabilities established during the pretraining phase. Text-Only datasets. To maintain the language ability of the model, we also use text-only instruction-tuning datasets [ 98 , 4 , 18 , 68 , 91 , 70 , 84 , 6 , 19 ] during the SFT stage.

3.1 Vision-Language Alignment Data

(3.1 Vision-Language Alignment Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: The alignment stage focuses on training the MLP connector to bridge the pretrained visual encoder and the LLM. For this initial warmup phase, we utilize ShareGPT4V [ 12 ] , a dataset containing approximately 1.2M caption and conversation samples.

3.2 Vision-Language Pretraining Data

(3.2 Vision-Language Pretraining Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Following DeepSeek-VL [ 59 ] , our pretraining data combines vision-language (VL) and text-only data to maintain a balance between VL capabilities and text-only performance. For DeepSeek-VL2, we maintain a ratio of around 70% VL data to 30% text-only data, with the latter sourced directly from our base LLM pretraining corpus. In the following, we categorize the VL data into several groups and describe their details. Interleaved image-text data. Our data collection begins with several open-sourced datasets, including WIT [ 79 ] , WikiHow [ 38 ] , and 30% random samples from OBELICS [ 41 ] . Thi...

3.2 Vision-Language Pretraining Data

(3.2 Vision-Language Pretraining Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: es. To mitigate this, we implemented a quality control pipeline using DeepSeek Chat [ 53 ] to score all captions simply based on their writing quality. In practice, this approach is both efficient and effective in filtering out low-quality captions. Optical character recognition data. To develop OCR capabilities, we used open-source datasets including LaTeX OCR [ 7 ] and 12M RenderedText [ 93 ] . We combined these datasets with an extensive in-house OCR dataset covering diverse document types. Currently, our in-house dataset mainly focuses on English and Chinese character recognition. We plan ...

3.2 Vision-Language Pretraining Data

(3.2 Vision-Language Pretraining Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: f|> in the given image. • Response: <|ref|><|/ref|><|det|>[[x1, y1, x2, y2],…]<|/det|> during training, the question prompts are randomly sampled from a candidate pool during training. <|ref|> , <|/ref|> , <|det|> , <|/det|> are special tokens. is a placeholder for either the category name (e.g., “car”) or description of the object (e.g., “the leftmost person”). [[x1, y1, x2, y2], …] is a list of bounding boxes, where each bounding box corresponds to an object’s position. The coordinates x1, y1 and x2, y2 specify the top-left and bottom-right corners respectively, normalized to ...

3.3 Supervised Fine-tuning Data

(3.3 Supervised Fine-tuning Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Our SFT data combines a diverse collection of open-sourced datasets with high-quality in-house QA pairs. Below, we detail our efforts to enhance the quality of our SFT dataset. General visual question-answering. While public visual QA datasets are diverse [ 74 , 10 , 43 , 9 , 27 , 31 , 47 ] , they often suffer from three main limitations: (1) short responses, (2) poor OCR quality, and (3) hallucinated content. To address these issues, we regenerate responses by jointly considering the original questions, images, and OCR information. Our experiments demonstrate that this approach produces more ...

3.3 Supervised Fine-tuning Data

(3.3 Supervised Fine-tuning Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: improves document-based interactions. Table and chart understanding. We enhanced table-based QA data by regenerating responses for all public datasets [ 14 , 49 ] based on their original questions except Cauldron [ 43 ] , which already exhibits high quality. Similar to our OCR capabilities developed during VL pretraining, our model demonstrated strong performance in chart understanding without requiring additional efforts. Reasoning, logic, and mathematics. We enhance public reasoning-focused datasets [ 76 , 43 , 61 , 17 , 102 , 109 ] with more detailed reasoning processes and standardize resp...

3.3 Supervised Fine-tuning Data

(3.3 Supervised Fine-tuning Data - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ref|> , <|det|> , <|/det|> are special tokens. The placeholder represents phrases like “an object within the red bounding box” while is the model’s description of the detected object (e.g., “cat”). Grounded conversation. We construct our grounded conversation data using [ 62 , 72 ] to further enhance the model’s capabilities established during the pretraining phase. Text-Only datasets. To maintain the language ability of the model, we also use text-only instruction-tuning datasets [ 98 , 4 , 18 , 68 , 91 , 70 , 84 , 6 , 19 ] during the SFT stage.

4 Training Methodology

(4 Training Methodology - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4.1 Training Pipelines DeepSeek-VL2 is trained through a three-stage pipeline: (1) an initial stage where we train the vision encoder and vision-language adaptor MLP while keeping the language model fixed, using image-text paired data detailed in Section 3.1 , (2) a pretraining stage where we conduct vision-language pre-training using the data described in Section 3.2 , and (3) a fine-tuning stage where we perform supervised fine-tuning with the data outlined in Section 3.3 . In both the pretraining and fine-tuning stages, all model parameters, including the vision encoder, vision-language ada...

4 Training Methodology

(4 Training Methodology - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 10 − 4 4.2\times 10^{-4} 1.4 × 10 − 5 1.4\times 10^{-5} 4.5 × 10 − 4 4.5\times 10^{-4} 4.5 × 10 − 4 4.5\times 10^{-4} 2 × 10 − 5 2\times 10^{-5} Visual Encoder LR multiplier 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Fix langauge model ✓ × \times × \times ✓ × \times × \times ✓ × \times × \times LR scheduler Cosine Step Constant Cosine Step Constant Cosine Step Constant Weight decay 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Optimizer AdamW( β 1 = 0.9 , β 2 = 0.95 \beta_{1}=0.9,\beta_{2}=0.95 ) AdamW( β 1 = 0.9 , β 2 = 0.95 \beta_{1}=0.9,\beta_{2}=0.95 ) Adam...

4 Training Methodology

(4 Training Methodology - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: house vision-language SFT data, we optimize all parameters while supervising only the answers and special tokens, masking both system and user prompts. To strengthen dialogue comprehension, we combine multimodal data with the pure text dialogue data from DeepSeek-V2 [ 53 ] . This approach ensures robust performance across diverse vision-language tasks, including dense image captioning, general VQA, OCR, table/chart/document/figure understanding, visual-to-code, visual reasoning, visual grounding, and language understanding, etc.. Table 3: Comparison with state-of-the-art models on OCR-related ...

4 Training Methodology

(4 Training Methodology - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: VL2-8B [ 16 ] 7.7B 0.3B 8.0B 91.6 83.3 74.8 77.4 794 Qwen2-VL-7B [ 88 ] 7.6B 0.7B 8.3B 94.5 83.0 76.5 84.3 845 Pixtral-12B [ 3 ] 12.0B 0.4B 12.4B 90.7 81.8 (CoT) 50.8 75.7 DeepSeek-VL 7B [ 59 ] 6.9B 0.4B 7.3B - - - - 456 DeepSeek-VL2 4.1B † 0.4B 4.5B † 93.3 86.0 78.1 84.2 811 4.2 Hyperparameters and Infrastructures Detailed hyperparameters for DeepSeek-VL2 training are listed in Table 2 . We conducted our training and evaluation using HAI-LLM [ 30 ] , an efficient and lightweight platform designed for large models. A significant challenge in our pipeline parallel strategy arose from the vision...

4.1 Training Pipelines

(4.1 Training Pipelines - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: DeepSeek-VL2 is trained through a three-stage pipeline: (1) an initial stage where we train the vision encoder and vision-language adaptor MLP while keeping the language model fixed, using image-text paired data detailed in Section 3.1 , (2) a pretraining stage where we conduct vision-language pre-training using the data described in Section 3.2 , and (3) a fine-tuning stage where we perform supervised fine-tuning with the data outlined in Section 3.3 . In both the pretraining and fine-tuning stages, all model parameters, including the vision encoder, vision-language adaptor, and language mode...

4.1 Training Pipelines

(4.1 Training Pipelines - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 4} 1.4 × 10 − 5 1.4\times 10^{-5} 4.5 × 10 − 4 4.5\times 10^{-4} 4.5 × 10 − 4 4.5\times 10^{-4} 2 × 10 − 5 2\times 10^{-5} Visual Encoder LR multiplier 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Fix langauge model ✓ × \times × \times ✓ × \times × \times ✓ × \times × \times LR scheduler Cosine Step Constant Cosine Step Constant Cosine Step Constant Weight decay 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Optimizer AdamW( β 1 = 0.9 , β 2 = 0.95 \beta_{1}=0.9,\beta_{2}=0.95 ) AdamW( β 1 = 0.9 , β 2 = 0.95 \beta_{1}=0.9,\beta_{2}=0.95 ) AdamW( β 1 = 0.9 , β 2 = 0...

4.1 Training Pipelines

(4.1 Training Pipelines - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: FT data, we optimize all parameters while supervising only the answers and special tokens, masking both system and user prompts. To strengthen dialogue comprehension, we combine multimodal data with the pure text dialogue data from DeepSeek-V2 [ 53 ] . This approach ensures robust performance across diverse vision-language tasks, including dense image captioning, general VQA, OCR, table/chart/document/figure understanding, visual-to-code, visual reasoning, visual grounding, and language understanding, etc.. Table 3: Comparison with state-of-the-art models on OCR-related multimodal benchmarks ....

4.1 Training Pipelines

(4.1 Training Pipelines - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 8.0B 91.6 83.3 74.8 77.4 794 Qwen2-VL-7B [ 88 ] 7.6B 0.7B 8.3B 94.5 83.0 76.5 84.3 845 Pixtral-12B [ 3 ] 12.0B 0.4B 12.4B 90.7 81.8 (CoT) 50.8 75.7 DeepSeek-VL 7B [ 59 ] 6.9B 0.4B 7.3B - - - - 456 DeepSeek-VL2 4.1B † 0.4B 4.5B † 93.3 86.0 78.1 84.2 811

4.2 Hyperparameters and Infrastructures

(4.2 Hyperparameters and Infrastructures - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Detailed hyperparameters for DeepSeek-VL2 training are listed in Table 2 . We conducted our training and evaluation using HAI-LLM [ 30 ] , an efficient and lightweight platform designed for large models. A significant challenge in our pipeline parallel strategy arose from the vision encoder’s unique computational characteristics compared to LLM blocks. As the first component in the model pipeline, the vision encoder requires careful load balancing across GPUs to prevent pipeline bubbles and optimize GPU utilization. To address this, we implemented fine-grained layer division of the vision enco...

5 Evaluation

(5 Evaluation - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 5.1 Multimodal Performance Benchmarks We perform a holistic evaluation of DeepSeek-VL2 across a collection of commonly used benchmarks, including DocVQA [ 66 ] , ChartQA [ 65 ] , InfoVQA 2 2 2 Given that InfoVQA contains images with extreme aspect ratios and excessively large images, we enlarge the candidate resolutions as C R = { ( m ⋅ 384 , n ⋅ 384 ) ∣ m ∈ ℕ , n ∈ ℕ , 1 ≤ m , n , m ​ n ≤ 18 } C_{R}=\{(m\cdot 384,n\cdot 384)\leavevmode\nobreak\ \mid\leavevmode\nobreak\ m\in\mathbb{N},n\in\mathbb{N},1\leq m,n,mn\leq 18\} when evaluating. [ 67 ] , TextVQA [ 77 ] , RealWorldQA [ 95 ] , OCRBench ...

5 Evaluation

(5 Evaluation - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ] 1.5B † - 86.4* 34.9 - - - - - 60.4 34 MiniCPM-V 2.0 [ 99 ] 2.8B - - 38.2 1,809 69.6 68.1 - - - 38.7 InternVL2-2B [ 16 ] 2.2B 49.8 74.1 36.3 1,877 73.2 70.9 69.6 50.4 57.3 46.3 Qwen2-VL-2B [ 88 ] 2.2B 48 74.4 41.1 1,872 74.9 73.5 72.2 54.5 62.9 47.8 MM 1.5-3B [ 107 ] 3B - 65.7 37.1 1,798 - - - - 56.9 44.4 DeepSeek-VL2-Small 2.8B † 57.0 80.0 48.0 2,123 82.3 80.3 79.3 62.9 65.4 60.7 Open-source Model (4B - 13B) Phi-3.5-Vision [ 1 ] 4.1B 47.5 78.1 43 - 76 66.1 72.1 53.6 53.6 43.9 InternVL2-4B [ 16 ] 4.1B 54.3 78.9 47.9 2,060 78.6 73.9 75.8 55.7 60.7 58.6 Aria-MoE [ 46 ] 4.3B † - - 54.9 - - - - -...

5 Evaluation

(5 Evaluation - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: 7 80.2 79.9 DeepSeek-VL2-Tiny 84.7 87.8 78.4 75.9 83.9 67.4 73.8 83.9 InternVL2-2B [ 16 ] 82.3 88.2 75.9 73.5 82.8 63.3 77.6 78.3 DeepSeek-VL2-Small 93.9 95.3 91.3 89.4 92.9 84.8 92.6 92.6 Open-source VLM (4B - 9B) Shikra-7B [ 11 ] 87.0 90.6 80.2 81.6 87.4 72.1 82.3 82.2 TextHawk2-7B [ 103 ] 91.9 93.0 87.6 86.2 90.0 80.4 88.2 88.1 Ferret-v2-7B [ 108 ] 92.8 94.7 88.7 87.4 92.8 79.3 89.4 89.3 InternVL2-8B [ 16 ] 87.1 91.1 80.7 79.8 87.9 71.4 82.7 82.7 MM1.5-7B [ 107 ] - 92.5 86.7 - 88.7 77.8 - 87.1 Qwen2-VL-7B [ 88 ] 91.7 93.6 87.3 85.8 90.5 79.5 87.3 87.8 DeepSeek-VL2 95.1 96.7 92.7 91.2 94.9 8...

5 Evaluation

(5 Evaluation - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ytelling capability of DeepSeek-VL2 . Our model can accept multiple images as input and narrate a story in either Chinese or English based on the images. 5.2 Qualitative Study In this section, we demonstrate different capabilities of DeepSeek-VL2, ranging from general question answering to visual storytelling and visual grounding. General visual question answering. Benefited from our new VL pretraining dataset and diverse SFT data. DeepSeek-VL2 demonstrated significantly improved ability on general visual question answering, as shown in Figure 4 . Overall, this model excels at dense image desc...

5 Evaluation

(5 Evaluation - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: on) and varied plot types ( e . g ., happy or tragic endings), which may inherently conflict with the safety requirements in LLM/VLM research. We aim to explore solutions to broaden the scope of storytelling while considering these challenges. Figure 8: Visual grounding ability of DeepSeek-VL2 . Our model can locate objects based on their category names, descriptions, or some abstract concepts. Figure 9: Grounded conversation with DeepSeek-VL2 . Our model can perform reasoning on images while identifying the locations of relevant objects, thereby enabling the possibility of interacting with th...

5 Evaluation

(5 Evaluation - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: s with accurate locations in its response, as demonstrated in Figure 9 . This enables the model to interact better with the real world, thereby creating opportunities to play a greater role in fields such as embodied AI and computer/phone agents.

5.1 Multimodal Performance

(5.1 Multimodal Performance - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Benchmarks We perform a holistic evaluation of DeepSeek-VL2 across a collection of commonly used benchmarks, including DocVQA [ 66 ] , ChartQA [ 65 ] , InfoVQA 2 2 2 Given that InfoVQA contains images with extreme aspect ratios and excessively large images, we enlarge the candidate resolutions as C R = { ( m ⋅ 384 , n ⋅ 384 ) ∣ m ∈ ℕ , n ∈ ℕ , 1 ≤ m , n , m ​ n ≤ 18 } C_{R}=\{(m\cdot 384,n\cdot 384)\leavevmode\nobreak\ \mid\leavevmode\nobreak\ m\in\mathbb{N},n\in\mathbb{N},1\leq m,n,mn\leq 18\} when evaluating. [ 67 ] , TextVQA [ 77 ] , RealWorldQA [ 95 ] , OCRBench [ 57 ] , AI2D [ 34 ] , MMMU...

5.1 Multimodal Performance

(5.1 Multimodal Performance - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: - - 60.4 34 MiniCPM-V 2.0 [ 99 ] 2.8B - - 38.2 1,809 69.6 68.1 - - - 38.7 InternVL2-2B [ 16 ] 2.2B 49.8 74.1 36.3 1,877 73.2 70.9 69.6 50.4 57.3 46.3 Qwen2-VL-2B [ 88 ] 2.2B 48 74.4 41.1 1,872 74.9 73.5 72.2 54.5 62.9 47.8 MM 1.5-3B [ 107 ] 3B - 65.7 37.1 1,798 - - - - 56.9 44.4 DeepSeek-VL2-Small 2.8B † 57.0 80.0 48.0 2,123 82.3 80.3 79.3 62.9 65.4 60.7 Open-source Model (4B - 13B) Phi-3.5-Vision [ 1 ] 4.1B 47.5 78.1 43 - 76 66.1 72.1 53.6 53.6 43.9 InternVL2-4B [ 16 ] 4.1B 54.3 78.9 47.9 2,060 78.6 73.9 75.8 55.7 60.7 58.6 Aria-MoE [ 46 ] 4.3B † - - 54.9 - - - - - - 66.1 MM 1.5-7B [ 107 ] 7B...

5.1 Multimodal Performance

(5.1 Multimodal Performance - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: ny 84.7 87.8 78.4 75.9 83.9 67.4 73.8 83.9 InternVL2-2B [ 16 ] 82.3 88.2 75.9 73.5 82.8 63.3 77.6 78.3 DeepSeek-VL2-Small 93.9 95.3 91.3 89.4 92.9 84.8 92.6 92.6 Open-source VLM (4B - 9B) Shikra-7B [ 11 ] 87.0 90.6 80.2 81.6 87.4 72.1 82.3 82.2 TextHawk2-7B [ 103 ] 91.9 93.0 87.6 86.2 90.0 80.4 88.2 88.1 Ferret-v2-7B [ 108 ] 92.8 94.7 88.7 87.4 92.8 79.3 89.4 89.3 InternVL2-8B [ 16 ] 87.1 91.1 80.7 79.8 87.9 71.4 82.7 82.7 MM1.5-7B [ 107 ] - 92.5 86.7 - 88.7 77.8 - 87.1 Qwen2-VL-7B [ 88 ] 91.7 93.6 87.3 85.8 90.5 79.5 87.3 87.8 DeepSeek-VL2 95.1 96.7 92.7 91.2 94.9 87.4 92.8 92.9 Comparison wi...

5.1 Multimodal Performance

(5.1 Multimodal Performance - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: Seek-VL2 . Our model can accept multiple images as input and narrate a story in either Chinese or English based on the images.

5.2 Qualitative Study

(5.2 Qualitative Study - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In this section, we demonstrate different capabilities of DeepSeek-VL2, ranging from general question answering to visual storytelling and visual grounding. General visual question answering. Benefited from our new VL pretraining dataset and diverse SFT data. DeepSeek-VL2 demonstrated significantly improved ability on general visual question answering, as shown in Figure 4 . Overall, this model excels at dense image description and it is able to recognize common landmarks, general visual knowledge, and rich-texts in both English and Chinese. It also performs favorably on chart understanding wi...

5.2 Qualitative Study

(5.2 Qualitative Study - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: oaden the scope of storytelling while considering these challenges. Figure 8: Visual grounding ability of DeepSeek-VL2 . Our model can locate objects based on their category names, descriptions, or some abstract concepts. Figure 9: Grounded conversation with DeepSeek-VL2 . Our model can perform reasoning on images while identifying the locations of relevant objects, thereby enabling the possibility of interacting with the real world. Figure 10: In-context visual grounding with DeepSeek-VL2 . Given one image, either with or without visual prompts, DeepSeek-VL2 is able to find relevant objects i...

5.2 Qualitative Study

(5.2 Qualitative Study - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: greater role in fields such as embodied AI and computer/phone agents.

6 Conclusion

(6 Conclusion - 详见原文) DeepSeek团队通过创新的架构设计和训练方法,在该领域取得了显著进展。模型在相关基准测试中表现出色,验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献,推动了技术发展。未来将继续优化和改进相关技术。
原文: In this technical report, we introduce DeepSeek-VL2, an enhanced version of MoE-based Vision-Language Models, available in scales of 3B, 16B, and 27B parameters in total, with corresponding activated parameters of 1.0B, 2.8B, and 4.5B. This configuration facilitates efficient computational consumption during both training and inference stages. Notably, our 3B, 16B and 27B models can be deployed on a single GPU with 10 GB, 40GB and 80GB memory respectively. We employ a dynamic tiling vision encoding strategy to efficiently process high-resolution images with various aspect ratios. By making cod...