DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL：迈向真实世界的视觉语言理解

📄 arXiv: 2403.05525📅 2024-03-08PDF

翻译进度62 / 62 段 (100%)

中文摘要

DeepSeek-VL 视觉语言模型实现了真实世界的视觉语言理解能力，包括文档理解、图像理解和细粒度定位。采用创新的视觉编码器架构和高效的交叉注意力机制，能够在低计算成本下处理高分辨率图像。支持多轮视觉对话、文档级 OCR、目标检测等多种任务。

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan, DeepSeek-AI 摘要：我们提出DeepSeek-VL，一种开源的视觉-语言（VL）模型，旨在实现现实世界的视觉和语言理解。DeepSeek-VL在广泛的多模态基准测试中实现了最先进的性能，同时保持了语言基准测试的强大性能。我们公开了1.3B和7B两种规模的模型，以推动基于这一基础模型的创新。

原文: Haoyu Lu* 1† Wen Liu* 1 Bo Zhang* 1‡ Bingxuan Wang 1† Kai Dong 1 Bo Liu 1† Jingxiang Sun 1† Tongzheng Ren 1† Zhuoshu Li 1 Hao Yang 1† Yaofeng Sun 1 Chengqi Deng 1 Hanwei Xu 1 Zhenda Xie 1 Chong Ruan 1 1 DeepSeek-AI {neal, liuwen, bo}@deepseek.com https://github.com/deepseek-ai/DeepSeek-VL Abstract We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: • Data Construction : We strive to ensure our data is diverse, scalable and extensively covers real-world s...

DeepSeek-VL: Towards Real-World Vision-Language Understanding

作为视觉-语言聊天机器人，在现实世界应用中实现了优越的用户体验，在广泛的语言基准测试中实现了最先进或有竞争力的性能，同时保持了语言中心基准测试的强大性能。我们已经公开了1.3B和7B模型，以促进基于这一基础模型的创新。

原文: uperior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model. 1 Introduction The remarkable success of large language models (LLMs) (OpenAI, 2022 , 2023a ; Anthropic, 2023 ; Google, 2023 ) has fueled the demand for a versatile interface that can handle multiple modalities ...

DeepSeek-VL: Towards Real-World Vision-Language Understanding

场景，主要由于以下原因：许多开源解决方案将大量计算资源分配到指令微调阶段。然而，训练强大语言模型的经验强调了广泛预训练在开发通用智能中的重要性。为使多模态模型具备丰富的世界知识，应该重视预训练阶段。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: scenarios, primarily due to the following reasons: • Many open-source solutions allocate a significant proportion of computational resources to the instruction tuning phase. However, the experience of training powerful language models underscores the importance of extensive pretraining in the development of general intelligence. To imbue multimodal models with rich world knowledge, there should be an emphasis on comprehensive pretraining that leverages a broad spectrum of vision-language data. • A common practice is to amalgamate various academic datasets during instruction tuning. While such ...

DeepSeek-VL: Towards Real-World Vision-Language Understanding

平衡多模态的训练策略。在此基础上，我们开发了一种指导模型从1B扩展到7B的训练方法。这些全面探索在实际设置中带来了显著的性能优势，与其他类似规模的大型多模态模型（LMMs）相比。DeepSeek-VL的预训练数据集由多种来源组成。

原文: ng strategy that balances the multi-modalities. On top of these, we develop a training methodology that steers the model scaling, from 1B to 7B. These comprehensive explorations bring a significant performance advantage in practical settings, compared to other large multimodal models (LMMs) of similar size. DeepSeek-VL’s pretraining dataset is compiled from a variety of sources, including but not limited to Common Crawl, Web Code, E-books, Educational Materials, and arXiv Articles. This collection thoroughly encompasses real-world scenarios such as web screenshots, PDFs, OCR, charts, and knowl...

DeepSeek-VL: Towards Real-World Vision-Language Understanding

使其在文本-图像交错和多轮推理场景中变得可行。在多模态模型预训练期间，一个常见的挑战是当训练过程过于依赖视觉-语言数据时语言能力的潜在退化。我们的研究表明，保持大量语言数据比例——特别是至少70%——是保持模型性能的关键。

原文: making it feasible for both text-image interleaving and multi-turn inference scenarios. During the pretraining of multimodal models, a common challenge encountered is the potential degradation of language capabilities when the training process is overly reliant on vision-language data. Our research reveals that maintaining a significant proportion of language data—specifically, at least 70%—is essential to preserve the integrity of language knowledge within the model. This balance is critical for achieving a robust multimodal capability that does not compromise language performance. Moreover, ...

DeepSeek-VL: Towards Real-World Vision-Language Understanding

并使广泛的应用成为可能，我们公开了两个版本，1.3B和7B，希望促进不同计算能力的需求。2 数据构建：多样且大规模的数据集是视觉语言模型训练最重要的组成部分。我们的数据集可分为两部分：视觉-语言预训练数据和视觉-语言监督微调数据。

原文: and enable a wide range of applications, we have made two versions of our ours, 1.3B and 7B, publicly accessible, in the hope of facilitating the needs of varying computational capabilities. 2 Data Construction A diverse and large dataset is the most important ingredient of visual language model training. Our dataset can be divided into two parts: Vision-Language pretraining Data and Vision-Language Supervised Fine-Tuning Data. VL pretraining Data is composed of visual-text data from various sources, aimed at enhancing the model’s fundamental cross-modal understanding capabilities; while VL Su...

1 Introduction

1 引言大语言模型（LLMs）的显著成功推动了对能够处理语言以外多种模态的通用接口的需求。响应这一增长需求，我们看到了大型多模态模型（LMMs）的出现，如GPT-4V和Gemini，它们作为通用助手。

原文: The remarkable success of large language models (LLMs) (OpenAI, 2022 , 2023a ; Anthropic, 2023 ; Google, 2023 ) has fueled the demand for a versatile interface that can handle multiple modalities beyond language. In response to this growing demand, we have seen an emergence of Large Multimodal Models (LMMs) like GPT-4V (OpenAI, 2023b ) and Gemini (Team et al., 2023 ) , which serve as versatile assistants capable of comprehending and acting upon instructions that span vision and language. These models exhibit considerable promise in executing complex, diverse real-world tasks, enabling more nat...

1 Introduction

强调利用广泛视觉-语言数据的综合预训练。常见做法是在指令微调期间合并各种学术数据集。虽然这种方法可能产生良好的基准测试结果，但往往无法提供真实的现实世界使用体验。在模型架构方面，prior作品大多采用视觉Transformer。

原文: emphasis on comprehensive pretraining that leverages a broad spectrum of vision-language data. • A common practice is to amalgamate various academic datasets during instruction tuning. While such an approach may yield good benchmark results, it often falls short in providing an authentic real-world usage experience. • In terms of model architecture, prior works mostly adapt a vision transformer, typically text-aligned, to a pre-trained language model. However, most of these models operate on a relatively low resolution, e.g., 336 × \times 336 or 448 × \times 448. The intricacies of complex rea...

1 Introduction

扩展到Common Crawl、Web代码、电子书、教育材料和arXiv文章。这一集合全面涵盖了网页截图、PDF、OCR、图表和知识内容（专业知识、教科书）等现实世界场景，旨在广泛而实用的代表性，同时保持可扩展性。

原文: ted to Common Crawl, Web Code, E-books, Educational Materials, and arXiv Articles. This collection thoroughly encompasses real-world scenarios such as web screenshots, PDFs, OCR, charts, and knowledge-based content (expertise, textbooks), aiming for a broad and practical representation while remaining scalable. While our pretraining data encompasses a wide array of world knowledge, we meticulously curate our instruction-tuning dataset to reflect real-world usage scenarios. To achieve this, we manually gather authentic test cases for GPT-4V and Gemini from the Internet. These cases have been sy...

1 Introduction

保持模型内语言知识的完整性至关重要。这种平衡对于实现不损害语言性能的稳健多模态能力至关重要。此外，我们引入了一种新颖的模态预热策略。这种方法仔细调整训练期间的模态比例，逐渐加入更多视觉-语言数据。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: ial to preserve the integrity of language knowledge within the model. This balance is critical for achieving a robust multimodal capability that does not compromise language performance. Moreover, we introduce a novel “modality warm-up” strategy. This approach carefully adjusts the ratio of modalities during training, gradually incorporating more vision-language data. The careful tuning of the modality ratio along with the warm-up strategy results in a balanced performance of both modalities. When iterating on our model, We conduct experiments on a small scale before scaling to a larger model ...

2 Data Construction

2 数据构建多样且大规模的数据集是视觉语言模型训练最重要的组成部分。我们的数据集可分为两部分：视觉-语言预训练数据和视觉-语言监督微调数据。VL预训练数据由各种来源的视觉-文本数据组成，旨在增强模型的基本跨模态理解能力。

原文: A diverse and large dataset is the most important ingredient of visual language model training. Our dataset can be divided into two parts: Vision-Language pretraining Data and Vision-Language Supervised Fine-Tuning Data. VL pretraining Data is composed of visual-text data from various sources, aimed at enhancing the model’s fundamental cross-modal understanding capabilities; while VL Supervised Fine-Tuning Data has a relatively smaller size and aims to teach the model to complete specific downstream tasks. By design, VL pretraining Data is used to warm up the vision-language adaptor in trainin...

2 Data Construction

O (Krylov等人，2021) HierText (Long等人，2022) 文档OCR arXiv渲染markdown (Blecher等人，2023) 2.1% 纯文本语料库 DeepSeek-LLM 2T文本语料库 (DeepSeek-AI, 2024) 70.0%。我们研究中使用的预训练数据集涵盖各种公开可访问的来源，除了一些专有数据。

原文: O ( Krylov et al. , 2021 ) HierText ( Long et al. , 2022 ) Document OCR arXiv rendered markdown ( Blecher et al. , 2023 ) 2.1% Text-only corpus DeepSeek-LLM 2T text copus ( DeepSeek-AI , 2024 ) 70.0% The pretraining dataset utilized in our study encompasses a diverse range of publicly accessible sources, in addition to a selection of proprietary data. We provide a comprehensive overview of the data sources employed during the joint vision and language pretraining stage in Table 1 . Such a dataset can facilitate LLM’s comprehension of the entities portrayed in the images. Furthermore, we presen...

2 Data Construction

反向渲染。这涉及处理来自Stack数据集的约146万个Jupyter notebook。通过提取这些notebook并汇总所有图表及其对应的前置代码段，我们成功策划了一个包含200万对图像和代码的集合。为了更好的数据质量，我们过滤了110万。

原文: al plots inverse rendering. This involved the processing of approximately 1.46 million Jupyter notebooks from the Stack dataset (Kocetkov et al., 2023 ) . By extracting these notebooks and collating all diagrams along with their corresponding preceding code segments, we succeeded in curating a collection featuring 2 million pairs of images and codes. For better data quality, we filter 1.1 million instances, each comprising a singular image coupled with a minimum of 5 lines of code, to constitute our primary training dataset. Document Optical Character Recognition (OCR) data facilitates the rec...

2 Data Construction

al., 2017), ReCTS (Zhang等人, 2019), TextOCR (Singh等人, 2021), OpenVINO (Krylov等人, 2021)和HierText (Long等人, 2022)。纯文本语料库用于保持语言中心任务的熟练度。在本研究中，我们使用与DeepSeek-LLM相同的文本语料库。

原文: t al., 2017 ) , ReCTS (Zhang et al., 2019 ) , TextOCR (Singh et al., 2021 ) , OpenVINO (Krylov et al., 2021 ) and HierText (Long et al., 2022 ) . Text-only corpus serves to maintain proficiency in language-centric tasks. In this study, we employ the same text corpus with DeepSeek-LLM (DeepSeek-AI, 2024 ) . Table 2: Summary of data used in our joint vision and language supervised fine-tuning stage. Class Dataset Ratio In-house Data SFT data based on taxonomy (Figure 3 ) 10.5% General Multi-modality ShareGPT4V ( Chen et al. , 2023 ) 35.5% LAION-GPTV ( LAION , 2023 ) LVIS-Instruct4V ( Wang et al....

2 Data Construction

形式。图像到代码 UI到代码、图表到代码、照片到SVG/p64编码、公式到代码、流程图到代码。图像到文本图像到提示、文本摘要、基于图像的创建、文本解释。分析：这种用例类型要求模型使用特定知识和逻辑能力，基于图像内容做出合理分析和理解。

原文: form. Image to Code UI to Code, Chart to Code, Photo to SVG/p64 Encoding, Formula to Code, Flowchart to Code Image to Text Image to Prompt, Text Summary, Image-based Creation, Text Interpretation Analysis This type of use case requires the model to use specific knowledge and logical ability to make reasonable analysis and understanding based on image content, and describe the image according to instructions. Data Chart Analysis Graph Interpretation, Table Interpretation Professional Chart Analysis Circuit Diagram, Flowchart, Map, Music Score, Financial Chart, Floor Plan, Others Professional Im...

2 Data Construction

化学、生物、代码、IQ问题。评估：这种用例类型要求模型根据特定标准评估图像内容。现实评估、相似性评估、美学评估、开放式评估、改进建议。多图：这种用例类型检验模型分析和理解多张图像的能力。时间序列理解。

原文: istry, Biology, Code, IQ Questions Evaluation This type of use case requires the model to evaluate the image content according to specific criteria. - Reality Evaluation, Similarity Evaluation, Aesthetic Evaluation, Open-ended Evaluation, Improvement Suggestions Multi-graph This type of use case examines the model’s ability to analyze and understand multiple images. Temporal Sequence Understanding Event Prediction, Image Sequencing, Behavior Analysis Multi-graph Comparison Attribute Comparison, Image-Text Matching, Finding Associations, Spotting Differences, Image Discrimination Safety This ty...

2 Data Construction

来自各种在线来源的GPT-4V和Gemini的真实测试用例。这些测试用例经过仔细分析和组织，形成全面的分类法，涵盖多个类别，如识别、转换、分析、推理、评估和安全，如表3所示。这种结构化分类法作为选择代表性样本的指导。

原文: a diverse set of authentic test cases for GPT-4V and Gemini from various online sources. These test cases are then carefully analyzed and organized into a comprehensive taxonomy, which encompasses multiple categories, such as recognition, conversion, analysis, reasoning, evaluation, and safety, as detailed in Table 3 . This structured taxonomy serves as a guideline for selecting representative prompts for each test image, ensuring that our instruction-tuning dataset is both practical and relevant to real-world applications. Moreover, this taxonomy is also employed to construct a balanced and c...

2.1 Vision-Language pretraining Data

2.1 视觉-语言预训练数据表1：联合视觉和语言预训练阶段使用的数据集摘要。类别数据集比例交错图像-文本 MMC4 (Zhu等人, 2024) 13.1% Wikipedia EN&CN Wikihow (Yang等人, 2021) 内部PDF和Epub教科书图像描述 Capsfusion (Yu等人, 2023a) 11.1%。

原文: Table 1: Summary of datasets used in the joint vision and language pretraining stage. Category Dataset Ratio Interleaved image-text MMC4 ( Zhu et al. , 2024 ) 13.1% Wikipedia EN& CN ( Foundation , ) Wikihow ( Yang et al. , 2021 ) in-house PDF and Epub textbooks Image caption Capsfusion ( Yu et al. , 2023a ) 11.1% TaiSu ( Liu et al. , 2022b ) Detailed Caption ( echo840 , 2024 ) Table and chart Chart2text ( Kantharaj et al. , 2022 ) 2.1% Geo170K ( Gao et al. , 2023 ) Ureader ( Ye et al. , 2023 ) Unichart ( Masry et al. , 2023 ) M-paper ( Hu et al. , 2023 ) ScienceQA ( Lu et al. , 2022b ) ScreenQ...

2.1 Vision-Language pretraining Data

我们利用三个公开数据集MMC4 (Zhu等人, 2024)、Wiki (Burns等人, 2023)、Wikihow (Yang等人, 2021)和Epub教科书。图像描述数据来自三个高质量的图像-文本配对数据集：Capsfusion (Yu等人, 2023a)、TaiSu (Liu等人, 2022b)和Detailed Caption。表格和图表数据使模型能够学习通用表格理解能力。

原文: and we utilize three public datasets MMC4 (Zhu et al., 2024 ) , Wiki (Burns et al., 2023 ) , Wikihow (Yang et al., 2021 ) and Epub textbooks. Image caption data come from three high-quality image-text paired datasets: Capsfusion (Yu et al., 2023a ) , TaiSu (Liu et al., 2022b ) and Detailed Caption (echo840, 2024 ) . Table and chart data enable the models to learn the capability for general table and chart image understanding. It encompasses a diverse range of public data sources, including Chart2text (Kantharaj et al., 2022 ) , Geo170K (Gao et al., 2023 ) , Unichart (Masry et al., 2023 ) , Ure...

2.1 Vision-Language pretraining Data

这些文档。尽管存在公开可访问的小规模数据集Latex-OCR (Blecher, 2024)，我们额外构建了一个全面的英文和中文文档OCR数据集。它由两部分组成：arXiv文章：我们收集了140万arXiv文章的源代码和编译的PDF。利用Nougat的预处理工具进行渲染。

原文: ese documents. Despite the existence of the publicly accessible small-scale dataset Latex-OCR (Blecher, 2024 ) , we additionally constructed a comprehensive English and Chinese document OCR dataset. It is comprised of two parts: 1): arXiv Articles: We collected source code and compiled PDFs from 1.4 million arXiv articles. Utilizing pre-processing tools from Nougat (Blecher et al., 2023 ) , we rendered these articles into paired images and texts; 2): E-books and Educational Materials: We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024 ) alongside million...

2.1 Vision-Language pretraining Data

Web Code Screen-to-code (Abi, 2024) 2.0% ScreenQA (Hsiao等人, 2022) 纯文本SFT DeepSeek-LLM (DeepSeek-AI, 2024) 47.9%。主要类别：识别。这部分用例主要检验大模型对图像内容的理解和描述能力，不需要高知识储备和推理能力。

原文: Web Code Screen-to-code ( Abi , 2024 ) 2.0% ScreenQA ( Hsiao et al. , 2022 ) Text-only SFT DeepSeek-LLM ( DeepSeek-AI , 2024 ) 47.9% Main Category Description Secondary Category Tertiary Category Recognition This part of the use cases mainly examines the understanding and description ability of large models for image content, which does not require high knowledge reserve and reasoning ability of the model, and some tasks can be completed using traditional machine learning models. Global Description Theme Description, Event/Behavior Description, Location/Scene Description, Emotion/Mood Descript...

2.1 Vision-Language pretraining Data

相关知识的娱乐、历史知识。常识推理：这种用例类型主要测试模型对生活中常识的理解和掌握，需要基于图像内容的解释和分析结合常识进行推理。关系推理：人际关系、空间关系、大小关系、特定对象识别。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: elated Knowledge, Entertainment Related Knowledge, Historical Knowledge Commonsense Reasoning This type of use case mainly tests the model’s understanding and mastery of common sense in life, which requires reasoning based on the interpretation and analysis of image content combined with common sense. Relationship Reasoning Interpersonal Relationship, Spatial Relationship, Size Relationship, Species Relationship Function Reasoning Hardware Function Reasoning, Software Function Reasoning Environment Reasoning Environment State Analysis, Environment-based Behavior Reasoning, Embodied Intelligenc...

2.1 Vision-Language pretraining Data

高质量内部多模态SFT数据在该分类法中全面体现。这一分类系统为训练数据的选择和组织提供了系统化框架。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: h-quality in-house multi-modality SFT data are comprehensively represented in this taxonomy.

2.2 Supervised Fine-tuning Data

2.2 监督微调数据我们研究中使用的监督微调数据集涵盖各种多模态和语言数据来源，包括著名的开源共享gpt4v数据集，如ShareGPT4V (Chen等人, 2023)、LAION-GPTV (LAION, 2023)、LVIS-Instruct4V (Wang等人, 2023a)、textOCR-GPT4V (Carter, 2024)、LLaVA1.6-GPT4V (Liu等人, 2024a)和IconQA (Lu等人, 2021)。

原文: The supervised fine-tuning datasets utilized in our study encompass a diverse range of multi-modality and language data sources, including well-known open-source shared gpt4v datasets such as ShareGPT4V (Chen et al., 2023 ) , LAION-GPTV (LAION, 2023 ) , LVIS-Instruct4V (Wang et al., 2023a ) , textOCR-GPT4V (Carter, 2024 ) , LLaVA1.6-GPT4V (Liu et al., 2024a ) and IconQA (Lu et al., 2021 ) . Additionally, we incorporate partial table and chart data extracted from pretraining datasets such as Ureader (Ye et al., 2023 ) , ScreenQA (Hsiao et al., 2022 ) , Geo170K (Gao et al., 2023 ) , and ScienceQ...

2.2 Supervised Fine-tuning Data

此外，我们将DeepSeek-LLM (DeepSeek-AI, 2024)中使用的纯文本SFT数据作为我们联合视觉和语言SFT数据的一部分，以保持语言能力。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: e, we include the text-only SFT data employed in DeepSeek-LLM (DeepSeek-AI, 2024 ) as part of our joint vision and language SFT data.

3 Approach

3 方法 3.1 架构我们的系统包含三个模块：混合视觉编码器、视觉适配器和语言模型。我们在本节中介绍每个部分。混合视觉编码器：我们采用SigLIP作为视觉编码器，从视觉输入中提取高级语义特征表示。然而，我们观察到单个SigLIP编码器难以全面解决所有现实世界问题。

原文: 3.1 Architecture Our system contains three modules: a hybrid vision encoder, a vision adaptor, and a language model. We introduce each part in this section. Hybrid Vision Encoder. We employ SigLIP as the vision encoder to extract high-level semantic feature representations from visual inputs. However, we observe that a single SigLIP encoder struggles to address all real-world questions comprehensively. Vision encoders in the CLIP family, including SigLIP, are primarily designed for semantic visual representations but are challenged by ambiguous encoding, resulting in visually distinct images b...

3 Approach

e、64×64×256由SAM-B生成，VL适配器最初将其插值到96×96×256的大小。随后，它采用两个步长为2的卷积层，生成24×24×1024的特征图，并将其重塑为576×1024。与此同时，SigLIP-L生成的576×1024大小的低分辨率特征图与高分辨率特征连接。

原文: e, 64 x 64 x 256 generated by SAM-B, the VL Adaptor initially interpolates it into a size of 96 x 96 x 256. Subsequently, it employs two convolutional layers with a stride of 2, producing a feature map of 24 x 24 x 1024, and reshapes it to 576 x 1024. Alongside this, the low-resolution feature map of size 576 x 1024 generated by SigLIP-L is concatenated with the high-resolution features, resulting in 576 visual tokens with 2048 dimensions. These visual tokens possess a substantial capacity for enhancing high-level semantic visual recognition and low-level visual grounding tasks. Then they unde...

3 Approach

预训练。具体来说，DeepSeek-VL-1B模型基于DeekSeek-LLM-1B模型构建，后者使用约5000亿文本token的语料库进行训练。而DeekSeek-VL-7B模型利用使用估计2万亿文本token训练的DeepSeek-LLM-7B模型开发。图3：我们的训练管道由三个阶段组成。

原文: pretraining. Specifically, the DeepSeek-VL-1B model is constructed based on the DeekSeek-LLM-1B model, which underwent training with an approximate corpus of 500 billion text tokens. And the DeekSeek-VL-7B model is developed leveraging the DeepSeek-LLM-7B model trained with an estimated 2 trillion text tokens. Figure 3: Our training pipelines consist of three stages. Stage 1 involves training the Vision-Language (VL) adaptor while keeping the hybrid vision encoder and language model fixed. Stage 2 is the crucial part of the joint vision and language pretraining, where both VL adaptor and langu...

3 Approach

更小的参数容量。这种模型容量的限制限制了此阶段可以学习的能力。一个自然的问题出现了：数据扩展定律在此阶段是否有效？为解决这个问题，我们在表8中进行了简单实验。结果表明，在此阶段扩大数据规模没有提供好处，甚至可能导致性能下降。

原文: tly smaller parameter capacity. This limitation in model capacity restricts the capabilities that can be learned during this stage. A natural question arises: Can the law of data scaling be effective at this stage? To address this question, we conducted a simple experiment in Table 8 . The results demonstrate that expanding the data scale at this stage does not provide benefits and may even lead to inferior performance. Consequently, we proceed to unfreeze the Large Language Model (LLM) and investigate efficient vision-language pretraining approaches during stage 2. 3.2.2 Stage 2: Joint Vision...

3 Approach

LLM中语言能力的显著遗忘。合适的比例（多模态:语言=70%:30%）可以有效减轻语言遗忘问题，同时增强模型的多模态能力。联合语言-多模态训练：为解决这一挑战，我们设计了一种简单而有效的联合语言-多模态训练策略。

原文: significant forgetting of language capabilities in LLMs. A suitable ratio (multimodal:language=70%:30%) can effectively mitigate the issue of language forgetting while simultaneously enhancing the model’s multimodal abilities. Joint Language-multimodal Training To address this challenge, we devise a straightforward yet effective joint language-multimodal training strategy. During training, we not only engage in multimodal data training but also incorporate a large proportion of language data into the training. This approach aims to balance the training focus, mitigating the adverse effects obs...

3 Approach

规模扩展到7B模型。幸运的是，我们观察到从1.3B模型获得的大量成果可以通过利用SFT有效转移到7B模型（如编码器设计）。然而，在阶段2训练期间，我们遇到了1.3B模型生成指标的显著波动。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: uently scaling it up to the 7B model. Fortunately, we have observed that a significant portion of the outcomes obtained from the 1.3B models can be effectively transferred to the 7B model through the utilization of SFT (e.g., the encoder design). However, during the stage 2 training phase, we have encountered considerable fluctuations in the generative metrics of the 1.3B model, rendering it challenging to supervise the training process effectively. And this has been discussed in Schaeffer et al. ( 2024 ) , "sharp and unpredictable changes might be induced by the researcher’s choice of measure...

3 Approach

并参与对话，最终创建了交互式DeepSeek-VL-Chat模型。我们使用表2中显示的视觉-语言SFT数据优化语言模型、VL适配器和混合视觉编码器，SAM-B由于有限的GPU内存保持冻结。我们只监督答案和特殊token。

原文: d engage in dialogue, culminating in the creation of the interactive DeepSeek-VL-Chat model. We optimize the language model, VL adaptor, and hybrid vision encoder with the vision-language SFT data as shown in Table 2 , SAM-B remains frozen due to the limited GPU memory. We only supervise answers and special tokens and mask the system and user prompts. To guarantee the model’s comprehensive proficiency in dialogue, we utilize a blend of multimodal data and pure text dialogue data used in DeepSeek-LLM. This approach ensures the model’s versatility across various dialogue scenarios. Figure 5: Vis...

3 Approach

ayanan等人, 2021; Korthikanti等人, 2023)以及在DeepSeek-LLM (DeepSeek-AI, 2024)中重叠计算和通信。DeepSeek-VL-7B在64节点集群上消耗了5天，每个节点包含8个Nvidia A100 GPU，而DeepSeek-VL-1B在16节点设置上消耗了7天。

原文: ayanan et al., 2021 ; Korthikanti et al., 2023 ) and overlap computation and communication as in DeepSeek-LLM (DeepSeek-AI, 2024 ) . DeepSeek-VL-7B consumed 5 days on a cluster of 64 nodes, each comprising 8 Nvidia A100 GPUs, while DeepSeek-VL-1B consumed 7 days on a setup involving 16 nodes. DeepSeek-VL 1B DeepSeek-VL-7B Vision Encoder SigLIP SigLIP+SAM Hyperparameters Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3 Learning rate 1.0 × 10 − 3 1.0 superscript 10 3 1.0\times 10^{-3} 3 × 10 − 5 3 superscript 10 5 3\times 10^{-5} 2.0 × 10 − 5 2.0 superscript 10 5 2.0\times 10^{-5} 1.0 × 10 − 3 1....

3.1 Architecture

3.1 架构我们的系统包含三个模块：混合视觉编码器、视觉适配器和语言模型。我们在本节中介绍每个部分。混合视觉编码器：我们采用SigLIP作为视觉编码器，从视觉输入中提取高级语义特征表示。然而，我们观察到单个SigLIP编码器难以全面解决所有现实世界问题。

原文: Our system contains three modules: a hybrid vision encoder, a vision adaptor, and a language model. We introduce each part in this section. Hybrid Vision Encoder. We employ SigLIP as the vision encoder to extract high-level semantic feature representations from visual inputs. However, we observe that a single SigLIP encoder struggles to address all real-world questions comprehensively. Vision encoders in the CLIP family, including SigLIP, are primarily designed for semantic visual representations but are challenged by ambiguous encoding, resulting in visually distinct images being encoded as s...

3.1 Architecture

由SAM-B生成，VL适配器最初将其插值到96×96×256的大小。随后，它采用两个步长为2的卷积层，生成24×24×1024的特征图，并将其重塑为576×1024。与此同时，SigLIP-L生成的576×1024大小的低分辨率特征图与高分辨率特征连接，产生576个视觉token。

原文: generated by SAM-B, the VL Adaptor initially interpolates it into a size of 96 x 96 x 256. Subsequently, it employs two convolutional layers with a stride of 2, producing a feature map of 24 x 24 x 1024, and reshapes it to 576 x 1024. Alongside this, the low-resolution feature map of size 576 x 1024 generated by SigLIP-L is concatenated with the high-resolution features, resulting in 576 visual tokens with 2048 dimensions. These visual tokens possess a substantial capacity for enhancing high-level semantic visual recognition and low-level visual grounding tasks. Then they undergo GeLU activati...

3.1 Architecture

具体来说，DeepSeek-VL-1B模型基于DeekSeek-LLM-1B模型构建，后者使用约5000亿文本token的语料库进行训练。而DeekSeek-VL-7B模型利用使用估计2万亿文本token训练的DeepSeek-LLM-7B模型开发。图3：我们的训练管道由三个阶段组成。阶段1涉及训练视觉-语言适配器。

原文: ifically, the DeepSeek-VL-1B model is constructed based on the DeekSeek-LLM-1B model, which underwent training with an approximate corpus of 500 billion text tokens. And the DeekSeek-VL-7B model is developed leveraging the DeepSeek-LLM-7B model trained with an estimated 2 trillion text tokens. Figure 3: Our training pipelines consist of three stages. Stage 1 involves training the Vision-Language (VL) adaptor while keeping the hybrid vision encoder and language model fixed. Stage 2 is the crucial part of the joint vision and language pretraining, where both VL adaptor and language model are tra...

3.2 Training Pipelines

3.2 训练管道我们分三个连续阶段训练DeepSeek-VL，如图3所示：视觉-语言适配器预热、联合视觉-语言预训练和监督微调。我们目前专注于视觉理解能力，仅在语言部分计算下一个token预测损失。3.2.1 阶段1：训练视觉-语言适配器。

原文: We train our DeepSeek-VL in three consecutive stages as shown in Figure 3 : vision-language adaptor warmup, joint vision-language pretraining, and supervised fine-tuning. We currently focus on visual understanding capabilities and only calculate the next token prediction loss on the language part. 3.2.1 Stage 1: Training Vision-Language Adaptor The primary objective of this stage is to establish a conceptual link between visual and linguistic elements within the embedding space, thereby facilitating the comprehensive understanding of depicted entities in the images by the Large Language Model ...

3.2 Training Pipelines

我们保持视觉编码器冻结并优化语言模型和VL适配器。最初，我们尝试直接用多模态数据训练LLM。然而，我们发现虽然多模态性能指标逐步改善，但语言指标急剧下降，如图4所示（多模态:语言=100%:0%）。这突显了直接训练的内在挑战。

原文: . We keep the vision encoder frozen and optimize the language model and VL adaptor. Initially, we attempt to directly train the LLM with multimodal data. However, we find while the metrics for multimodal performance incrementally improved, there is a stark and severe decline in language metrics as illustrated in Figure 4 (Multimodal:Language=100%:0%),. This underscores the inherent challenge in directly conducting multimodal pretraining on the foundation of an LLM, revealing a critical trade-off between enhancing multimodal abilities and preserving linguistic proficiency. We hypothesize that t...

3.2 Training Pipelines

整合语言数据显著减轻了语言能力的下降，展示了模型语言性能的实质性改善。语言数据的加入不会导致多模态性能的显著损失，表明模型保留了其多模态处理能力。不同模态的性能强烈相关。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: egrating language data significantly alleviates the decline in language capabilities, demonstrating a substantial improvement in the model’s linguistic performance. (2). The inclusion of language data does not lead to a significant loss in multimodal performance, indicating that the model retains its multimodal processing abilities. (3). The performance of different modalities is strongly correlated with their respective proportions in the training dataset, substantiating the competitive relationship between the two modalities. Ultimately, we opt for a training ratio of language to multimodal ...

3.2 Training Pipelines

问题：1.3B模型的有限容量和训练数据中缺乏SFT数据，两者都阻碍了模型准确遵循指令的能力。即使模型拥有正确选项的知识，也难以精确生成它们。为减轻这些挑战，我们采用了双管齐下的方法。首先，我们采用多选PPL方法论。

原文: s issue: the limited capacity of the 1.3B model and the absence of SFT data within the training dataset, both of which hinder the model’s ability to accurately follow instructions. Even when the model possesses knowledge of the correct options, it struggles to generate them precisely. To mitigate these challenges, we adopte a dual-pronged approach. Firstly, we employ the Multi-choice PPL methodology to monitor the model’s progress. This involves inputting not only the prompt and image into the network but also all the answer associated with the question. Subsequently, we calculate the PPL for ...

3.3 Hyperparameters and Infrastructures

3.3 超参数和基础设施所有阶段的详细超参数如图4所示。我们使用HAI-LLM (High-flyer, 2023)，一个轻量级高效的分布式训练框架，训练和评估DeepSeek-VL。由于我们使用视觉编码器将图像转换为嵌入向量，然后统一处理图像嵌入和文本嵌入，我们可以轻松地将管道并行适应VL模型训练。

原文: The detailed hyperparameters of all stages are illustrated in Table 4 . We train and evaluate our DeepSeek-VL with HAI-LLM (High-flyer, 2023 ) , a lightweight and efficient distributed training framework. Since we use visual encoders to convert images into embedding vectors and then treat image embeddings and text embeddings uniformly, we can easily adapt pipeline parallelism to VL model training: all we need to do is to view visual encoders and text embedding as a single module and take it as the first layer of the resulting model. This very first layer has a complicated model structure and p...

3.3 Hyperparameters and Infrastructures

调度器：Cosine Step Cosine Cosine Step Cosine。权重衰减：0.0。梯度裁剪：1.0。优化器：AdamW(β1=0.9, β2=0.95)。预热步数：128 2000 256 128 2000 256。这些超参数配置确保了训练的稳定性和效率。

原文: duler Cosine Step Cosine Cosine Step Cosine Weight decay 0.0 0.0 0.0 0.0 0.0 0.0 Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0 Optimizer AdamW( β 1 = 0.9 , β 2 = 0.95 formulae-sequence subscript 𝛽 1 0.9 subscript 𝛽 2 0.95 \beta_{1}=0.9,\beta_{2}=0.95 ) AdamW( β 1 = 0.9 , β 2 = 0.95 formulae-sequence subscript 𝛽 1 0.9 subscript 𝛽 2 0.95 \beta_{1}=0.9,\beta_{2}=0.95 ) Warm-up steps 128 2000 256 128 2000 256 Training steps 15000 96000 10000 15000 42000 10000 Batch size 256 1024 256 256 2304 256 Sequence length 512 4096 4096 512 4096 4096 Sequence packing × \times ✓ × \times × \times ✓ × \times Pipeline p...

4 Evaluation

4 评估 4.1 公共多模态基准评估我们在一系列公共基准上评估我们的模型：多模态综合理解数据集：MMMU (Yue等人, 2023)、CMMMU (Zhang等人, 2024)、MMBench (Liu等人, 2023a)、MMBench-CN (Liu等人, 2023a)、SeedBench (Li等人, 2023a)和MMV (Yu等人, 2023b)。

原文: 4.1 Public Multimodal Benchmarks Evaluation We evaluate our models on a series of public benchmarks: Multimodal comprehensive understanding datasets: MMMU (Yue et al., 2023 ) , CMMMU (Zhang et al., 2024 ) , MMBench (Liu et al., 2023a ) , MMBench-CN (Liu et al., 2023a ) , SeedBench (Li et al., 2023a ) and MMV (Yu et al., 2023b ) . We compare DeepSeek-VL with competitors on MMB/MMC-dev as current official test download link is no longer active. Chart/table understanding datasets: OCRBench (Liu et al., 2023b ) ; LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVet Close-source LMMs : Gemini Pro Unk ...

4 Evaluation

基于生成的评估，使用贪婪解码。这里的基于生成的评估指的是让模型生成自由文本并从生成文本中解析结果。如表5所示的比较结果表明，DeepSeek-VL-7B在广泛基准中超越了大多数类似规模的开源模型。DeepSeek-VL在MMB、MMC和SEEDbench等基准中优于类似规模的开源模型。

原文: eration-based evaluation with greedy decoding. The generation-based evaluation here refers to letting the model generate free texts and parsing results from generated texts. The comparative results, as illustrated in Table 5 , show that DeepSeek-VL-7B surpasses most open-source models of similar size across a wide range of benchmarks. DeepSeek-VL outperforms open-source models of similar size in benchmarks such as MMB, MMC, and SEEDbench, even approaching proprietary models (DeepSeek-VL vs. GPT-4V = 70.4 vs. 71.6 on seedbench), demonstrating its powerful natural image comprehension capability....

4 Evaluation

GSM8K (Cobbe等人, 2021)。代码数据集包括MBPP (Austin等人, 2021)。标准化考试包括AGIEval (Zhong等人, 2023)。我们将基于困惑度的评估应用于需要从几个选项中选择答案的数据集。这些数据集包括HellaSwag和MMLU。

原文: g GSM8K (Cobbe et al., 2021 ) . Code datasets including MBPP (Austin et al., 2021 ) . Standardized exams including AGIEval (Zhong et al., 2023 ) . We apply perplexity-based evaluation to datasets that require answers to be chosen from several options. These datasets include HellaSwag and MMLU. The perplexity-based evaluation here refers to calculating the perplexity of each option and selecting the lowest one as the model prediction. Perplexity-based evaluation helps to distinguish subtle probability difference between model predictions and avoids discontinuity of exact match style evaluation....

4 Evaluation

数据集进行人工评估。该数据集包含100个问题，分为七个类别，每个类别涵盖特定任务。这些类别和任务与我们内部SFT数据的分类法相同，如表3所示。这种方法确保我们测试的任务是通用的，涵盖了多模态模型的大部分用例。

原文: taset for manual evaluation. This dataset comprises 100 questions, divided into seven categories, each encompassing specific tasks. These categories and tasks are same as our taxonomy for the in-house SFT data, as shown in Table 3 . This approach ensures that the tasks we test are universal and encompass the majority of use cases for multimodal models. Moreover, based on the categories and tasks described in existing reports, we collect similar image materials and developed prompts. The sources for these image materials include royalty-free image communities and photographs taken by the resear...

4 Evaluation

模型，并让GPT-4V确定哪个更好或宣布平局。结果表明，在大多数情况下，GPT-4V倾向于更青睐DeepSeek-VL的回答质量。如图7所示，与开源多模态模型（包括Fuyu-8B、CogVLM）相比，DeepSeek-VL在超过60%的实例中被判定为优越。

原文: models and ask GPT-4V to determine which one is better or declare a tie. The results indicate a preference for DeepSeek-VL’s responses in the majority of cases, as GPT-4V tends to rate the quality of DeepSeek-VL’s answers more favorably. As illustrated in Figure 7 , DeepSeek-VL is judged to be superior in over 60% of instances when compared to open-source multimodal models, including Fuyu-8B, CogVLM-17B, and InternLM-XComposer2-VL. Moreover, in comparison with other proprietary models, such as GPT-4V itself, DeepSeek-VL demonstrates comparably exceptional performance. 4.4 Ablation Study Scale ...

4 Evaluation

和阶段3仍然略微落后于阶段1、阶段2和阶段3的组合性能，表明视觉-语言适配器预热阶段仍有意义。模态分组训练：当混合语言和多模态数据时，我们观察到在批处理级别直接混合它们会显著降低训练效率。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: and stage 3 still slightly lags behind the combined performance of stage 1, stage 2, and stage 3, indicating that vision-language adaptor warmup stage remains meaningful. Modality Group Training When mixing language and multimodal data, we observe that directly blending them at the batch level significantly reduces training efficiency. This inefficiency arises because each batch gradient backpropagation process waits for the slowest sample to complete. As a result, the predominantly faster-to-process pure language data ends up waiting for the multimodal samples to finish, leading to a decrease...

4 Evaluation

比率设为1，然后逐渐降低到最终模型训练的目标比率（如0.7）。图9：模态预热在语言（Pile-test）和多模态（MMBench和MMBench_CN）基准上的比较性能结果。模态预热在训练阶段2的所有评估任务中始终匹配或超过没有模态预热的approaches的性能。

原文: a ratio to 1, and then gradually decrease it to the target ratio for the final model training (e.g., 0.7). Figure 9: Comparative performance results on language (Pile-test) and multimodal (MMBench and MMBench_CN) benchmarks for modality warmup. Modality warmup consistently matches or surpasses the performance of approaches without modality warmup across all evaluated tasks on training stage 2 (Multimodal:Language=60%:40%). Our experiments, as illustrated in Figure 9 , demonstrate that this strategy effectively prevents a significant decline in language capabilities at the beginning of training...

4 Evaluation

性能，尽管这需要以更长的视觉特征token序列为代价，增加了计算需求。如表10顶部所示，通过在序列连接之前沿图像的宽度或高度维度堆叠视觉特征以保持序列长度恒定，减少序列长度并没有实现更好的结果。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: el performance, although this comes with the trade-off of increased computational requirements due to a longer sequence of visual feature tokens. As demonstrated in the top section of Table 10 , reducing the sequence length by stacking visual features along the image’s width or height dimensions before sequence concatenation, in order to keep the sequence length constant, does not achieve better results compared to simply merging them along the embedding dimension in most metrics. In terms of the adaptor architecture, employing separate MLP adaptors for each vision feature encoder allows for m...

4.1 Public Multimodal Benchmarks Evaluation

4.1 公共多模态基准评估我们在一系列公共基准上评估我们的模型：多模态综合理解数据集：MMMU (Yue等人, 2023)、CMMMU (Zhang等人, 2024)、MMBench (Liu等人, 2023a)、MMBench-CN (Liu等人, 2023a)、SeedBench (Li等人, 2023a)和MMV (Yu等人, 2023b)。

原文: We evaluate our models on a series of public benchmarks: Multimodal comprehensive understanding datasets: MMMU (Yue et al., 2023 ) , CMMMU (Zhang et al., 2024 ) , MMBench (Liu et al., 2023a ) , MMBench-CN (Liu et al., 2023a ) , SeedBench (Li et al., 2023a ) and MMV (Yu et al., 2023b ) . We compare DeepSeek-VL with competitors on MMB/MMC-dev as current official test download link is no longer active. Chart/table understanding datasets: OCRBench (Liu et al., 2023b ) ; LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVet Close-source LMMs : Gemini Pro Unk 48.9 - 75.2 74.0 70.7 659 - 45.2 59.2 GPT-4V...

4.1 Public Multimodal Benchmarks Evaluation

基于贪婪解码的生成评估。这里的基于生成的评估指的是让模型生成自由文本并从生成文本中解析结果。如表5所示的比较结果表明，DeepSeek-VL-7B在广泛基准中超越了大多数类似规模的开源模型。DeepSeek-VL在MMB、MMC和SEEDbench等基准中优于类似规模的开源模型。

原文: g. The generation-based evaluation here refers to letting the model generate free texts and parsing results from generated texts. The comparative results, as illustrated in Table 5 , show that DeepSeek-VL-7B surpasses most open-source models of similar size across a wide range of benchmarks. DeepSeek-VL outperforms open-source models of similar size in benchmarks such as MMB, MMC, and SEEDbench, even approaching proprietary models (DeepSeek-VL vs. GPT-4V = 70.4 vs. 71.6 on seedbench), demonstrating its powerful natural image comprehension capability. The model also surpasses all open-source mo...

4.2 Public Language Benchmarks Evaluation

4.2 公共语言基准评估我们在以下公共语言基准上评估我们的模型：多科目多选题数据集包括MMLU (Hendrycks等人, 2020)。语言理解和推理数据集包括HellaSwag (Zellers等人, 2019)。语言建模数据集包括Pile (Gao等人, 2020)。

原文: We evaluate our models on the following public language benchmarks: Multi-subject multiple-choice datasets including MMLU (Hendrycks et al., 2020 ) . Language understanding and reasoning datasets including HellaSwag (Zellers et al., 2019 ) . Language modeling datasets including Pile (Gao et al., 2020 ) . Version DeepSeek-VL DeepSeek-VL DeepSeek-LLM 1B Chat 7B Chat 7B Chat Encoder SigLIP SigLIP+SAM None Benchmark HellaSwag 56.0 68.4 68.5 MMLU 32.5 52.4 49.4 GSM8K 18.0 55.0 63.0 MBPP 10.0 35.2 35.2 AGIEval 14.0 27.8 19.3 Table 7: The performance on language benchmarks. Math datasets including GS...

4.2 Public Language Benchmarks Evaluation

尽管如此，DeepSeek-VL-7B在数学（GSM8K）方面表现出一定程度的下降，这表明尽管努力促进视觉和语言模态之间的和谐，它们之间仍然存在竞争关系。这可能归因于有限的模型容量（7B），更大的模型可能会显著缓解这一问题。

原文: theless, DeepSeek-VL-7B shows a certain degree of decline in mathematics (GSM8K), which suggests that despite efforts to promote harmony between vision and language modalities, there still exists a competitive relationship between them. This could be attributed to the limited model capacity (7B), and larger models might alleviate this issue significantly. Overall, DeepSeek-VL strives to achieve the goal of minimizing declines in language capability while addressing these challenges.

4.3 Human Evaluation

4.3 人工评估为进一步探索我们的DeepSeek-VL的能力，我们独立构建了一个用于人工评估的数据集。该数据集包含100个问题，分为七个类别，每个类别涵盖特定任务。这些类别和任务与我们内部SFT数据的分类法相同，如表3所示。

原文: To further explore the capabilities of our DeepSeek-VL, we independently construct a dataset for manual evaluation. This dataset comprises 100 questions, divided into seven categories, each encompassing specific tasks. These categories and tasks are same as our taxonomy for the in-house SFT data, as shown in Table 3 . This approach ensures that the tasks we test are universal and encompass the majority of use cases for multimodal models. Moreover, based on the categories and tasks described in existing reports, we collect similar image materials and developed prompts. The sources for these ima...

4.3 Human Evaluation

(Zheng等人, 2024)，我们向GPT-4V展示问题和两个不同模型的回答，并让GPT-4V确定哪个更好或宣布平局。结果表明，在大多数情况下，GPT-4V倾向于更青睐DeepSeek-VL的回答质量。如图7所示，DeepSeek-VL在超过60%的实例中被判定为优越。

原文: (Zheng et al., 2024 ) , we show GPT-4V the question and the answers from two different models and ask GPT-4V to determine which one is better or declare a tie. The results indicate a preference for DeepSeek-VL’s responses in the majority of cases, as GPT-4V tends to rate the quality of DeepSeek-VL’s answers more favorably. As illustrated in Figure 7 , DeepSeek-VL is judged to be superior in over 60% of instances when compared to open-source multimodal models, including Fuyu-8B, CogVLM-17B, and InternLM-XComposer2-VL. Moreover, in comparison with other proprietary models, such as GPT-4V itself,...

4.4 Ablation Study

4.4 消融研究扩展投影器训练：我们扩展了阶段1（投影器预热）的数据集，然后应用监督微调。如图8所示的结果表明，增加训练数据量并未在此阶段提升性能。这表明投影器的容量固有受限，使其无法捕捉广泛知识。

原文: Scale Up Projector Training We expand the dataset for stage 1 (projector warmup) and subsequently apply supervised fine-tuning. The results, depicted in Figure 8 , demonstrate that augmenting the training data volume does not enhance performance at this stage. This implies that the projector’s capacity is inherently constrained, rendering it incapable of capturing the extensive knowledge necessary for multimodal tasks. Stage 1, Training Step MMB MMC SEED POPE MMMU Average 2K 59.0 54.0 61.8 82.3 30.3 57.5 8K 58.0 45.0 58.5 84.9 29.2 55.1 20K 56.0 52.3 59.0 81.7 28.6 55.5 80K 58.1 55.0 58.6 78.6...

4.4 Ablation Study

整体训练效率下降。图8：模态预热在语言（Pile-test）和多模态（MMBench和MMBench_CN）基准上的比较分析表明，模态分组在语言任务中始终超越非分组模态方法，同时在训练阶段2的多模态任务上保持性能。

原文: ecrease in overall training efficiency. Figure 8: Comparative analysis of modality warmup on language (Pile-test) and multimodal (MMBench and MMBench _ _ \_ CN) benchmarks demonstrates that modality grouping consistently surpasses the non-grouped modality approach in language tasks, while simultaneously preserving performance on multimodal tasks on training stage 2 (Multimodal:Language=60%:40%). To address this issue, we experiment with grouping different modalities of data at each global step, sampling distinct modalities separately. This approach involves organizing the training data so that...

4.4 Ablation Study

训练，同时在语言和多模态领域的最终阶段产生相对优越的结果。这种渐进式适应使模型能够更无缝地调整到多模态数据的整合，从而提高整体训练稳定性和性能。视觉编码器选择：为了更好地获取和利用图像信息，我们比较了不同编码器。

原文: aining, while also yielding comparatively superior outcomes in the final phases for both the language and multimodal domains. This gradual adaptation enables the model to more seamlessly adjust to the incorporation of multimodal data, thereby improving overall training stability and performance. Vision Encoder Selection In order to better acquire and utilize image information, we compare the training loss of different vision encoders under our training settings except for reducing training steps of stage 2 to 8000 for efficiency. As illustrated in Figure 10 , the incorporation of vision-only s...

4.4 Ablation Study

更精确地调整视觉特征的具体值和分布模式，促进更平滑的模型训练。相反，为不同视觉编码器使用共享MLP适配器有助于充分的特征融合。我们采用混合策略，并报告稳定且改进的性能，如表10下部所述。这一设计决策对DeepSeek-VL的整体性能至关重要，为后续训练阶段奠定了基础。我们的实验结果验证了这一方法的有效性和必要性。

原文: for more precise adjustments to the specific values and distribution patterns of visual features, facilitating smoother model training. Conversely, using a shared MLP adaptor for different vision encoders contributes to adequate feature fusion. We adopt a mixed strategy and report stable and improved performance, as outlined in the lower section of Table 10 . Architecture MMB MMC SEED POPE ScienceQA MMMU OCRB Average Sequence Concatenation: Token Pooling - W 61.2 59.6 61.6 86.5 57.7 31.6 304 55.5 Token Pooling - H 59.9 58.3 61.6 83.8 55.0 32.0 291 54.2 Embedding Concatenation: Hybrid MLP 61.7 ...

5 Conclusion, Limitation, and Future Work

5 结论、局限性和未来工作在本技术报告中，我们介绍了DeepSeek-VL，一系列多模态大语言模型，提供1.3B和6.7B参数规模。本报告揭示了主导的基于投影器的预训练方法固有的局限性，为DeepSeek-VL采用的创新方法铺平了道路。

原文: In this technical report, we have introduced DeepSeek-VL, a series of Multimodal Large Language Models, available in scales of 1.3B and 6.7B parameters. This report has unveiled the limitations inherent in the predominant projector-based pretraining methodologies, setting the stage for the innovative approach adopted by DeepSeek-VL. By prioritizing a joint vision and language (VL) pretraining phase, DeepSeek-VL transcends traditional models by ensuring that the integration of multimodal data does not compromise the linguistic capabilities of the Large Language Models (LLMs). This is achieved t...

Appendix A Appendix

图11：可视化结果。DeepSeek-VL能够理解现实世界中儿童的编程图表，并提供详细而有组织的解释。图12：可视化结果。DeepSeek-VL对现实世界中的代码和图表具有强大的理解能力。图13：可视化结果。DeepSeek-VL拥有广泛的现实世界知识。图14：可视化结果。

原文: Figure 11: Visualization results. DeepSeek-VL can understand children’s programming diagrams from the real world and provide detailed and organized explanations. Figure 12: Visualization results. DeepSeek-VL has strong understanding capabilities for code and charts in the real world. Figure 13: Visualization results. DeepSeek-VL possesses extensive knowledge of the real world. Figure 14: Visualization results. DeepSeek-VL is capable of accurately reading the contents of real-world tables. ◄ Feeling lucky? Conversion report Report an issue View original on arXiv ►

← 返回首页详细解读