DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2：基于混合专家架构的先进多模态视觉语言模型

📄 arXiv: 2412.10302📅 2024-12-12PDF

翻译进度47 / 47 段 (100%)

中文摘要

DeepSeek-VL2 首次将 MoE（混合专家）架构应用于视觉语言模型，支持超高分辨率图像理解和复杂视觉推理。采用创新的视觉 token 压缩技术和动态专家路由机制，在保持高性能的同时大幅降低计算成本。在文档理解、图表分析、科学图表理解等任务上显著超越前人工作。

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

【摘要】DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding - 本文介绍了DeepSeek-VL2的架构、训练方法和实验结果。

原文: Zhiyu Wu ∗ Xiaokang Chen ∗ Zizheng Pan ∗ Xingchao Liu ∗ Wen Liu ∗,† Damai Dai Huazuo Gao Yiyang Ma Chengyue Wu Bingxuan Wang Zhenda Xie Yu Wu Kai Hu Jiawei Wang Yaofeng Sun Yukun Li Yishi Piao Kang Guan Aixin Liu Xin Xie Yuxiang You Kai Dong Xingkai Yu Haowei Zhang Liang Zhao Yisong Wang Chong Ruan ‡ DeepSeek-AI Abstract We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strateg...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

（DeepSeek-VL2: Mixture-of-Experts Vision-Language M - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 9 , 94 , 63 , 83 , 15 , 88 ] , extending the remarkable capabilities of Large Language Models (LLMs) to seamlessly process both visual and textual information. This advancement has dramatically expanded the potential for AI systems to tackle complex real-world applications that require multimodal understanding. In this technical report, we present DeepSeek-VL2, a new series of open-source Vision-Language Models that leverages the Mixture-of-Experts (MoE) architecture to achieve substantial improvements in both performance and efficiency compared to its predecessor, DeepSeek-VL [ 59 ] . Our adv...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

原文: rich feature extraction without the quadratic computational scaling typically associated with increasing image resolutions. For the language component, we leverage DeepSeek language models [ 20 , 53 ] , featuring the Multi-head Latent Attention (MLA) mechanism. MLA significantly reduces computational cost by compressing the Key-Value (KV) cache into a latent vector, resulting in faster inference and increased throughput capacity. We further enhance efficiency through the DeepSeekMoE framework [ 20 , 86 ] , which employs sparse computation techniques. Our model series adopt three MoE variants, ...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

原文: r, DeepSeek-VL2 introduces two major advancements: a dynamic tiling strategy and a DeepSeekMOE [ 20 , 86 ] language model featuring Multi-head Latent Attention [ 53 ] . These innovations enable more efficient processing of both high-resolution visual inputs and text data. Dynamic Tiling Strategy. The original DeepSeek-VL employed a hybrid vision encoder combining SigLIP [ 106 ] for coarse-grained feature extraction at 384 × 384 384\times 384 resolution and SAM-B [ 35 ] for fine-grained feature extraction at 1024 × 1024 1024\times 1024 resolution. While this fusion approach generated rich visua...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

原文: esolution ( m i ⋅ 384 , n i ⋅ 384 ) (m_{i}\cdot 384,n_{i}\cdot 384) that minimizes the padding area. The resized image is then divided into m i × n i m_{i}\times n_{i} local tiles of 384 × 384 384\times 384 pixels, plus one global thumbnail tile. The SigLIP-SO400M-384 vision encoder processes all ( 1 + m i × n i ) (1+m_{i}\times n_{i}) tiles, yielding 27 × 27 = 729 27\times 27=729 visual embeddings of 1152 1152 dimensions per tile. For computational efficiency and context length management, we disable the dynamic tiling strategy when processing multiple ( > 2 ) (>2) images. Figure 3: Illustrat...

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

原文: Figure 3 . DeepSeekMoE LLM. Our language model is based on DeepSeekMoE [ 20 , 86 ] , which incorporates the Multi-head Latent Attention mechanism [ 53 ] . MLA enhances inference efficiency by compressing the Key-Value cache into a latent vector, enabling increased throughput capacity. The model also incorporates a MoE architecture [ 20 ] allowing for efficient inference through sparse computation. During MoE training, we introduce a global bias term [ 86 ] for each expert to cost-effectively improve load balancing between experts. DeepSeek-VL2 comes in three variants with the following model s...

1 Introduction

【引言】DeepSeek-VL2的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Large Vision-Language Models (VLMs) have emerged as a transformative force in artificial intelligence [ 54 , 59 , 94 , 63 , 83 , 15 , 88 ] , extending the remarkable capabilities of Large Language Models (LLMs) to seamlessly process both visual and textual information. This advancement has dramatically expanded the potential for AI systems to tackle complex real-world applications that require multimodal understanding. In this technical report, we present DeepSeek-VL2, a new series of open-source Vision-Language Models that leverages the Mixture-of-Experts (MoE) architecture to achieve substan...

1 Introduction

原文: the language model. This design preserves the advantages of vision transformers with local attention, enabling rich feature extraction without the quadratic computational scaling typically associated with increasing image resolutions. For the language component, we leverage DeepSeek language models [ 20 , 53 ] , featuring the Multi-head Latent Attention (MLA) mechanism. MLA significantly reduces computational cost by compressing the Key-Value (KV) cache into a latent vector, resulting in faster inference and increased throughput capacity. We further enhance efficiency through the DeepSeekMoE f...

2 Model Architecture

【架构】DeepSeek-VL2的模型架构设计和技术细节。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: DeepSeek-VL2 consists of three core modules: (1) a vision encoder, (2) a vision-language adaptor, and (3) a Mixture-of-Experts language model. Building upon the decoder-only LLaVA-style [ 54 ] architecture of its predecessor, DeepSeek-VL2 introduces two major advancements: a dynamic tiling strategy and a DeepSeekMOE [ 20 , 86 ] language model featuring Multi-head Latent Attention [ 53 ] . These innovations enable more efficient processing of both high-resolution visual inputs and text data. Dynamic Tiling Strategy. The original DeepSeek-VL employed a hybrid vision encoder combining SigLIP [ 10...

2 Model Architecture

原文: We first resize the original image until its long side matches the target resolution, then pad the other dimension while maintaining the original aspect ratio. it to each candidate resolution in C R C_{R} . We select the resolution ( m i ⋅ 384 , n i ⋅ 384 ) (m_{i}\cdot 384,n_{i}\cdot 384) that minimizes the padding area. The resized image is then divided into m i × n i m_{i}\times n_{i} local tiles of 384 × 384 384\times 384 pixels, plus one global thumbnail tile. The SigLIP-SO400M-384 vision encoder processes all ( 1 + m i × n i ) (1+m_{i}\times n_{i}) tiles, yielding 27 × 27 = 729 27\times 2...

2 Model Architecture

原文: imes(n_{i}\cdot 14+1) visual tokens, which are subsequently projected into the language model’s embedding space using a two-layer multilayer perceptron (MLP). A visual illustration of our dynamic tiling strategy is shown in Figure 3 . DeepSeekMoE LLM. Our language model is based on DeepSeekMoE [ 20 , 86 ] , which incorporates the Multi-head Latent Attention mechanism [ 53 ] . MLA enhances inference efficiency by compressing the Key-Value cache into a latent vector, enabling increased throughput capacity. The model also incorporates a MoE architecture [ 20 ] allowing for efficient inference thr...

3 Data Construction

【数据/训练】DeepSeek-VL2的训练数据构建和训练流程。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We build a comprehensive Vision-Language dataset from diverse sources for DeepSeek-VL2. The training process is structured into three distinct stages: (1) VL alignment, (2) VL pretraining, and (3) supervised fine-tuning (SFT). In the following parts, we provide descriptions of the data used in each stage. 3.1 Vision-Language Alignment Data The alignment stage focuses on training the MLP connector to bridge the pretrained visual encoder and the LLM. For this initial warmup phase, we utilize ShareGPT4V [ 12 ] , a dataset containing approximately 1.2M caption and conversation samples. 3.2 Vision-...

3 Data Construction

原文: roblematic cases with brief descriptions, mismatched text pairs, or obvious hallucinations. To address these quality inconsistencies, we developed a comprehensive image captioning pipeline that considers: (1) OCR hints, (2) meta information (e.g., location, camera settings), and (3) relevant original captions as prompts. Using an in-house captioner, we recaption the images following prompting strategies similar to PixelProse [ 78 ] , employing varied instructions to guide the VLM’s caption generation. Despite the overall improvement in caption quality, we observed repetition issues in the larg...

3 Data Construction

原文: Websight using DeepSeek V2.5. We also exploit Python plot codes generated by DeepSeek V2.5 to mitigate the noises in the plot-to-code data. • QA with visual prompt. We follow [ 9 ] to construct visual prompt understanding data by overlaying various visual indicators (arrows, boxes, circles, and scribbles) onto images from [ 9 , 89 , 90 ] . We then created QA pairs focusing on objects highlighted by these visual prompts. Visual grounding data. We construct our visual grounding dataset from [ 71 , 75 ] . For each image’s object detection annotations, we structure the data as follows: • Prompt: L...

3 Data Construction

原文: -quality in-house QA pairs. Below, we detail our efforts to enhance the quality of our SFT dataset. General visual question-answering. While public visual QA datasets are diverse [ 74 , 10 , 43 , 9 , 27 , 31 , 47 ] , they often suffer from three main limitations: (1) short responses, (2) poor OCR quality, and (3) hallucinated content. To address these issues, we regenerate responses by jointly considering the original questions, images, and OCR information. Our experiments demonstrate that this approach produces more comprehensive and accurate results. During development, we observed that an e...

3 Data Construction

（3 Data Construction - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: nced table-based QA data by regenerating responses for all public datasets [ 14 , 49 ] based on their original questions except Cauldron [ 43 ] , which already exhibits high quality. Similar to our OCR capabilities developed during VL pretraining, our model demonstrated strong performance in chart understanding without requiring additional efforts. Reasoning, logic, and mathematics. We enhance public reasoning-focused datasets [ 76 , 43 , 61 , 17 , 102 , 109 ] with more detailed reasoning processes and standardize response formats which puts the final answer at the end of the response. We obse...

3 Data Construction

原文: sents phrases like “an object within the red bounding box” while is the model’s description of the detected object (e.g., “cat”). Grounded conversation. We construct our grounded conversation data using [ 62 , 72 ] to further enhance the model’s capabilities established during the pretraining phase. Text-Only datasets. To maintain the language ability of the model, we also use text-only instruction-tuning datasets [ 98 , 4 , 18 , 68 , 91 , 70 , 84 , 6 , 19 ] during the SFT stage.

3.1 Vision-Language Alignment Data

（3.1 Vision-Language Alignment Data - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The alignment stage focuses on training the MLP connector to bridge the pretrained visual encoder and the LLM. For this initial warmup phase, we utilize ShareGPT4V [ 12 ] , a dataset containing approximately 1.2M caption and conversation samples.

3.2 Vision-Language Pretraining Data

（3.2 Vision-Language Pretraining Data - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Following DeepSeek-VL [ 59 ] , our pretraining data combines vision-language (VL) and text-only data to maintain a balance between VL capabilities and text-only performance. For DeepSeek-VL2, we maintain a ratio of around 70% VL data to 30% text-only data, with the latter sourced directly from our base LLM pretraining corpus. In the following, we categorize the VL data into several groups and describe their details. Interleaved image-text data. Our data collection begins with several open-sourced datasets, including WIT [ 79 ] , WikiHow [ 38 ] , and 30% random samples from OBELICS [ 41 ] . Thi...

3.2 Vision-Language Pretraining Data

原文: es. To mitigate this, we implemented a quality control pipeline using DeepSeek Chat [ 53 ] to score all captions simply based on their writing quality. In practice, this approach is both efficient and effective in filtering out low-quality captions. Optical character recognition data. To develop OCR capabilities, we used open-source datasets including LaTeX OCR [ 7 ] and 12M RenderedText [ 93 ] . We combined these datasets with an extensive in-house OCR dataset covering diverse document types. Currently, our in-house dataset mainly focuses on English and Chinese character recognition. We plan ...

3.2 Vision-Language Pretraining Data

原文: f|> in the given image. • Response: <|ref|><|/ref|><|det|>[[x1, y1, x2, y2],…]<|/det|> during training, the question prompts are randomly sampled from a candidate pool during training. <|ref|> , <|/ref|> , <|det|> , <|/det|> are special tokens. is a placeholder for either the category name (e.g., “car”) or description of the object (e.g., “the leftmost person”). [[x1, y1, x2, y2], …] is a list of bounding boxes, where each bounding box corresponds to an object’s position. The coordinates x1, y1 and x2, y2 specify the top-left and bottom-right corners respectively, normalized to ...

3.3 Supervised Fine-tuning Data

（3.3 Supervised Fine-tuning Data - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Our SFT data combines a diverse collection of open-sourced datasets with high-quality in-house QA pairs. Below, we detail our efforts to enhance the quality of our SFT dataset. General visual question-answering. While public visual QA datasets are diverse [ 74 , 10 , 43 , 9 , 27 , 31 , 47 ] , they often suffer from three main limitations: (1) short responses, (2) poor OCR quality, and (3) hallucinated content. To address these issues, we regenerate responses by jointly considering the original questions, images, and OCR information. Our experiments demonstrate that this approach produces more ...

3.3 Supervised Fine-tuning Data

原文: improves document-based interactions. Table and chart understanding. We enhanced table-based QA data by regenerating responses for all public datasets [ 14 , 49 ] based on their original questions except Cauldron [ 43 ] , which already exhibits high quality. Similar to our OCR capabilities developed during VL pretraining, our model demonstrated strong performance in chart understanding without requiring additional efforts. Reasoning, logic, and mathematics. We enhance public reasoning-focused datasets [ 76 , 43 , 61 , 17 , 102 , 109 ] with more detailed reasoning processes and standardize resp...

3.3 Supervised Fine-tuning Data

原文: ref|> , <|det|> , <|/det|> are special tokens. The

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

中文摘要

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

1 Introduction

1 Introduction

2 Model Architecture

2 Model Architecture

2 Model Architecture

3 Data Construction

3 Data Construction

3 Data Construction

3 Data Construction

3 Data Construction

3 Data Construction

3.1 Vision-Language Alignment Data

3.2 Vision-Language Pretraining Data

3.2 Vision-Language Pretraining Data

3.2 Vision-Language Pretraining Data

3.3 Supervised Fine-tuning Data

3.3 Supervised Fine-tuning Data

3.3 Supervised Fine-tuning Data

4 Training Methodology

4 Training Methodology

4 Training Methodology

4 Training Methodology

4.1 Training Pipelines

4.1 Training Pipelines

4.1 Training Pipelines

4.1 Training Pipelines

4.2 Hyperparameters and Infrastructures

5 Evaluation

5 Evaluation

5 Evaluation

5 Evaluation

5 Evaluation

5 Evaluation

5.1 Multimodal Performance

5.1 Multimodal Performance

5.1 Multimodal Performance

5.1 Multimodal Performance

5.2 Qualitative Study

5.2 Qualitative Study

5.2 Qualitative Study

6 Conclusion