[原文]We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two
key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding
strategy designed for processing high-resolution images with different aspect ratios. For the
language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention
mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference
and high throughput. Trained on an improved vision-l...
[原文]Large Vision-Language Models (VLMs) have emerged as a transformative force in artificial
intelligence [15, 54, 59, 63, 83, 88, 94], extending the remarkable capabilities of Large Language
Models (LLMs) to seamlessly process both visual and textual information. This advancement has
dramatically expanded the potential for AI systems to tackle complex real-world applications
that require multimodal understanding. In this technical report, we present DeepSeek-VL2, a new series of open-source Vision-
Language Models that leverages the Mixture-of-Experts (MoE) architecture to achieve substantial
imp...
[原文]We further enhance efficiency through the DeepSeekMoE frame-
work [20, 86], which employs sparse computation techniques.Our model series adopt three
MoE variants, 3B, 16B, and 27B. These LLMs have 0.57B, 2.4B, and 4.1B activated parameters
respectively. We also greatly enhance our vision-language training data in terms of quality, quantity, and
diversity. This comprehensive dataset enables better generalization and performance across
a broad spectrum of tasks, including Visual Question Answering (VQA), Optical Character
Recognition (OCR), document/table/chart understanding, visual reasoning, a...
[原文]DeepSeek-VL2 consists of three core modules: (1) a vision encoder, (2) a vision-language
adaptor, and (3) a Mixture-of-Experts language model. Building upon the decoder-only LLaVA-
style [54] architecture of its predecessor, DeepSeek-VL2 introduces two major advancements:
a dynamic tiling strategy and a DeepSeekMOE [20, 86] language model featuring Multi-head
Latent Attention [53]. These innovations enable more efficient processing of both high-resolution
visual inputs and text data. Dynamic Tiling Strategy. The original DeepSeek-VL employed a hybrid vision encoder
combining SigLIP [106] for c...
[原文]For computational efficiency and
context length management, we disable the dynamic tiling strategy when processing multiple
(> 2) images.
1We first resize the original image until its long side matches the target resolution, then pad the other dimension
while maintaining the original aspect ratio.
4
\n
\n
\n
\n
Flatten
\n
\n
\n
\n
\n
\n
\n
\n
View Separator Token
sep
Input image
Local Tiles
New Line Token
Encode & Merge
Dynamic Tiling
Encode & Merge
Flatten
Figure 3 | Illustration of dynamic tiling strategy in DeepSeek-VL2.By dividing images
into multiple tiles, DeepSeek-VL2 achieves stronger...
[原文]The model also incorporates a MoE architecture [20] allowing for efficient inference through
sparse computation.During MoE training, we introduce a global bias term [86] for each expert to
cost-effectively improve load balancing between experts. DeepSeek-VL2 comes in three variants
with the following model sizes: 1.0B, 2.8B and 4.5B. Complete architectural specifications can be
found in Table 1.
5
[原文]We build a comprehensive Vision-Language dataset from diverse sources for DeepSeek-VL2.
The training process is structured into three distinct stages: (1) VL alignment, (2) VL pretraining,
and (3) supervised fine-tuning (SFT). In the following parts, we provide descriptions of the data
used in each stage.
[原文]The alignment stage focuses on training the MLP connector to bridge the pretrained visual
encoder and the LLM. For this initial warmup phase, we utilize ShareGPT4V [12], a dataset
containing approximately 1.2M caption and conversation samples.
[原文]Following DeepSeek-VL [59], our pretraining data combines vision-language (VL) and text-only
data to maintain a balance between VL capabilities and text-only performance. For DeepSeek-
VL2, we maintain a ratio of around 70% VL data to 30% text-only data, with the latter sourced
directly from our base LLM pretraining corpus. In the following, we categorize the VL data into
several groups and describe their details. Interleaved image-text data. Our data collection begins with several open-sourced datasets,
including WIT [79], WikiHow [38], and 30% random samples from OBELICS [41]. This specific
...
[原文]Currently, our in-house
6
dataset mainly focuses on English and Chinese character recognition.We plan to expand to
other languages in our future work. Visual question-answering (QA) data. In our early exploration, we found general QA data
clearly benefits model pretraining. Consequently, we developed a comprehensive visual QA
dataset consisting of the following categories:
• General VQA. We inherit the general VQA data from DeepSeek-VL. For more details,
please refer to [59].
• Table, chart and document understanding. We adopt PubTabNet [112], FinTabNet [111]
and Docmatix [42] to enhance docu...
[原文]We derived our grounded conversation dataset from [71], struc-
tured in the following format:
• Prompt: Can you describe the content of the image?
• Response: Two dogs[[x1, y1, x2, y2],...]
are running on the grass.As in other visual grounding data, , , , ,
are special tokens and x1, y1, x2, y2 is subject to the same normalization scheme.
7
[原文]Our SFT data combines a diverse collection of open-sourced datasets with high-quality in-house
QA pairs. Below, we detail our efforts to enhance the quality of our SFT dataset. General visual question-answering. While public visual QA datasets are diverse [9, 10, 27,
31, 43, 47, 74], they often suffer from three main limitations: (1) short responses, (2) poor OCR
quality, and (3) hallucinated content. To address these issues, we regenerate responses by jointly
considering the original questions, images, and OCR information. Our experiments demonstrate
that this approach produces more comprehen...
[原文]We enhance public reasoning-focused datasets [17, 43, 61,
76, 102, 109] with more detailed reasoning processes and standardize response formats which
puts the final answer at the end of the response.We observe that detailed responses are less
effective when training smaller VLMs. In our exploration, DeepSeek-VL2-Tiny shows better
performance with more concise responses. Textbook and academic questions. We build an internal dataset focused on textbooks from our
document collection. This dataset primarily emphasizes college-level contents across multiple
academic disciplines. Web-to-code and plo...
[原文]DeepSeek-VL2 is trained through a three-stage pipeline: (1) an initial stage where we train the
vision encoder and vision-language adaptor MLP while keeping the language model fixed, using
image-text paired data detailed in Section 3.1, (2) a pretraining stage where we conduct vision-
language pre-training using the data described in Section 3.2, and (3) a fine-tuning stage where
we perform supervised fine-tuning with the data outlined in Section 3.3. In both the pretraining
and fine-tuning stages, all model parameters, including the vision encoder, vision-language
adaptor, and language model,...
[原文]Detailed hyperparameters for DeepSeek-VL2 training are listed in Table 2. We conducted our
training and evaluation using HAI-LLM [30], an efficient and lightweight platform designed
for large models. A significant challenge in our pipeline parallel strategy arose from the
vision encoder’s unique computational characteristics compared to LLM blocks. As the first
component in the model pipeline, the vision encoder requires careful load balancing across GPUs
to prevent pipeline bubbles and optimize GPU utilization. To address this, we implemented
fine-grained layer division of the vision encoder ...
[原文]Benchmarks
We perform a holistic evaluation of DeepSeek-VL2 across a collection of com-
monly used benchmarks, including DocVQA [66], ChartQA [65], InfoVQA 2 [67], TextVQA [77],
RealWorldQA [95], OCRBench [57], AI2D [34], MMMU [105], MMStar [13], MathVista [60],
MME [26], MMBench, MMBench-V1.1 [58] and MMT-Bench [100]. These benchmarks span
diverse tasks from document understanding and chart interpretation to real-world problem solv-
ing, enabling comprehensive evaluation of our model’s capabilities. To evaluate the grounding
capability of our models, we test DeepSeek-VL2 on the RefCOCO, RefCO...
f|> 在给定图像中。 • 回复: <|ref|><|/ref|><|det|>[[x1, y1, x2, y2],…]<|/det|> 在训练过程中,问题提示语从候选池中随机采样。 <|ref|>、<|/ref|>、<|det|>、<|/det|> 为特殊标记。 是类别名称(例如,“car”)或物体描述(例如,“the leftmost person”)的占位符。[[x1, y1, x2, y2], …] 是一个边界框列表,其中每个边界框对应一个物体的位置。坐标 x1, y1 和 x2, y2 分别指定左上角和右下角,并根据图像分辨率归一化为 0 到 999 之间的值。我们还构建了负样本,其中被查询的物体故意不在图像中出现,以增强模型的鲁棒性。视觉定位对话数据。我们的视觉定位对话数据集源自文献 [ 71 ] ,结构如下所示: • 提示: <|grounding|>Can you describe the content of the image? • 回复: Two <|ref|>dogs<|/ref|><|det|>[[x1, y1, x2, y2],…]<|/det|> are running on the grass. 与其他视觉定位数据类似,<|grounding|>、<|ref|>、<|/ref|>、<|det|>、<|/det|> 均为特殊标记,且 x1, y1, x2, y2 遵循相同的归一化方案。
[原文]On the
grounding benchmarks, we compare DeepSeek-VL2 with Groudning DINO [56], UNINEXT [52],
ONE-PEACE [87], mPLUG-2 [97], Florence-2 [96], InternVL2 [16], Shikra [11], TextHawk2 [103],
Ferret-v2 [108], MM1.5 [107] and Qwen2 [88].Our models outperforms the other VLMs at
similar scales.
[原文]In this section, we demonstrate different capabilities of DeepSeek-VL2, ranging from general
question answering to visual storytelling and visual grounding. General visual question answering. Benefited from our new VL pretraining dataset and
diverse SFT data. DeepSeek-VL2 demonstrated significantly improved ability on general visual
question answering, as shown in Figure 4. Overall, this model excels at dense image description
and it is able to recognize common landmarks, general visual knowledge, and rich-texts in both
English and Chinese. It also performs favorably on chart understanding wit...
[原文]This serene but powerful
seascape captures nature’s grandeur through its interplay between landforms, water bodies, light conditions, and vegetation
all harmoniously coexisting within the same visual space.
这幅图描绘了⼀个湖边的⽇落景象,周围有⼭峦和⼀些停泊在⽔⾯的
船只。特别显眼的是中间的⼀艘⼤船,船上有着类似中国传统建筑⻛
格的屋檐。这种建筑⻛格⾮常像中国古代建筑,尤其是皇家或寺庙建
筑。
这种场景可能位于中国的某个著名湖泊,例如杭州的⻄湖或南京的⽞
武湖等。特别是如果这艘船是仿古式的游船,那么它很可能位于⻄湖
景区。⻄湖是中国最著名的湖泊之⼀,周围有着秀丽的⼭景和古建筑,
常常成为游客观光和体验传统⽂化的地⽅。
因此,图⽚中展示的可能是中国的⻄湖,那个有着丰富的历史⽂化和
⾃然美景的地⽅。
猜猜这是哪里?
Landmark Recognition
请给这幅景象写⼀⾸诗
《望公⼭》
浮云出岫晓轻寒,
郁郁苍⼭展画屏。
⼏树孤影来绿野,
⼀泓深⽔抱⾦城。
鸿鸣远岫千⼭起,
鹤伴闲居万物兴。
尽望烟霞⼭⾊秀,
却寻春⾊过江亭。
Poet Writing
The cake in ...
[原文]The x-axis represents the number of training
tokens in billions, while the y-axis shows the average normalized score, which is a measure of model performance.
### Key Points:
1. **Datasets Compared**:
- FineWeb 2 (Ours)
- CulturaX
- HPLT-2
- CC-100
- CommonCrawl
- mC4
- HPLT
2. **Trend Analysis**:
- FineWeb 2 consistently achieves the highest average normalized score, indicating superior performance compared to
other datasets across the given range of training tokens.
- The other datasets show varying levels of performance but tend to converge towards a similar score as the number of
training ...
图中在宣传什么东西? 自2012年起服务灵魂 文本识别 大黑板上写了什么。
图中展示的是与西湖醋鱼相关的食品和饮品。 左边的广告牌上展示的是西湖醋鱼蛋糕(West Lake Fish Cake),它是一种以西湖醋鱼为主题的食物。右边的广告牌则是在宣传西湖醋鱼咖啡,这种咖啡被称为“铁狗咖啡”,其特色是将西湖醋鱼的特色和杭州当地的经验结合在一起,把西湖醋鱼的味道融入到咖啡中。 西湖醋鱼是杭州的一道著名美食,主要使用草鱼制作,经过料酒、醋、糖等调料烹饪,酸甜可口。现在通过将西湖醋鱼加入到其他食品和饮品中,这一传统美食得到了全新的诠释。
[原文]The over-the-top behavior contrasts with the seemingly simple instruction, creating an
unexpected and funny outcome.
3. **Relatable scenario**: This image humorously captures a widely understood, everyday behavior—
children wanting things they are told they cannot have, leading to playful and sometimes rebellious
actions.
4. **Visual cue**: The child's action of taking a bite adds a dynamic and engaging element to the
humor, amplifying the sense of mischief and delight in the moment.In short, this image humorously captures the playful defiance of childhood, using the visual of a child
taking a...
[原文]I sat in the Paris Airport terminal, the word
"BIENVENUE" etched in bold, inviting letters above me.The Eiffel Tower symbol floated serenely, a promise of what was to come. I
reached into my bag, feeling the cool metal of my suitcase handle, and bit my lip, trying not to sink further into the comfort of the
grey seats. The ceiling lights buzzed softly, casting a sterile glow that neither cheered nor soothed my nerves. I boarded the plane, the roar of engines a symphony of promise. As we soared above the clouds, the world below became a
patchwork of life—houses, roads, rivers—each a story waiti...
[原文]We aim to explore solutions to broaden the scope of storytelling while considering
these challenges.Visual grounding. Visual grounding is a new ability we bring to DeepSeek-VL2. In Figure 8,
we show the general grounding ability of DeepSeek-VL2. Interestingly, although the majority
of images in our training set come from natural scenes, and the referring expressions are
object category names or specific descriptions of objects, we find that the model is capable of
generalizing to other scenarios (such as memes and animes), and has the ability to recognize
certain celebrities and abstract conce...
[原文]Our model can perform reasoning on
images while identifying the locations of relevant objects, thereby enabling the possibility of
interacting with the real world.In-context Visual Grounding
In the first image, an object within the red rectangle is
marked. Locate the object of the same category in the second image.
Find the most frequently appearing fruit from the first
image in the second image.
According to the first image, which dish contains that
ingredient in the second image? Figure 10 | In-context visual grounding with DeepSeek-VL2. Given one imag...
[原文]In this technical report, we introduce DeepSeek-VL2, an enhanced version of MoE-based Vision-
Language Models, available in scales of 3B, 16B, and 27B parameters in total, with corresponding
activated parameters of 1.0B, 2.8B, and 4.5B. This configuration facilitates efficient computational
consumption during both training and inference stages. Notably, our 3B, 16B and 27B models
can be deployed on a single GPU with 10 GB, 40GB and 80GB memory respectively. We employ
a dynamic tiling vision encoding strategy to efficiently process high-resolution images with
various aspect ratios. By making co...
[原文]Gqa: A new dataset for real-world visual reasoning
and compositional question answering.In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 6700–6709, 2019.
[32] A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Weli-
hinda, A. Hayes, A. Radford, et al. Gpt-4v(ision) system card. 2023.
[33] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects
in photographs of natural scenes. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 787–798, 20...
[原文]Multimodal ArXiv: A dataset
for improving scientific comprehension of large vision-language models.In ACL, 2024.
[49] L. Li, Y. Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset
for improving scientific comprehension of large vision-language models. arXiv preprint
arXiv:2403.00231, 2024.
[50] X. Li, F. Zhang, H. Diao, Y. Wang, X. Wang, and L.-Y. Duan. Densefusion-1m: Merging
vision experts for comprehensive multimodal perception. arXiv preprint arXiv:2407.08303,
2024.
[51] Z. Li, X. Yang, K. Choi, W. Zhu, R. Hsieh, H. Kim, J. H. Lim, S. Ji, B. Lee, X. Yan, et al. ...
[原文]Mathvista: Evaluating mathematical reasoning of foundation models in visual
contexts.In The Twelfth International Conference on Learning Representations.
[61] P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa:
A new benchmark for abstract diagram understanding and visual language reasoning.
arXiv preprint arXiv:2110.13214, 2021.
[62] C. Ma, Y. Jiang, J. Wu, Z. Yuan, and X. Qi. Groma: Localized visual tokenization for
grounding multimodal large language models. In European Conference on Computer
Vision, pages 417–435. Springer, 2025.
[63] Y. Ma, X. Liu, X...
[原文]Large-scale classification of fine-art paintings: Learning the
right metric on the right feature. arXiv preprint arXiv:1505.00855, 2015.
[74] S.Shah, A. Mishra, N. Yadati, and P. P. Talukdar. Kvqa: Knowledge-aware visual question
answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33,
pages 8876–8884, 2019.
25
[75] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun. Objects365: A
large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF
international conference on computer vision, pages 8430–8439, 2019.
[76] W. Sh...
[原文]Openmathinstruct-2: Accelerating ai for math with massive open-source instruction
data. arXiv preprint arXiv:2410.01560, 2024.
[85] J.Wang, P. Zhang, T. Chu, Y. Cao, Y. Zhou, T. Wu, B. Wang, C. He, and D. Lin. V3det:
Vast vocabulary visual detection dataset. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 19844–19854, 2023.
[86] L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy
for mixture-of-experts. CoRR, abs/2408.15664, 2024. doi: 10.48550/ARXIV.2408.15664. URL https://doi.org/10.48550/arXiv.2408.15664.
[87] P. Wang, ...
[原文]In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818–4829,
2024.
[97] H.Xu, Q. Ye, M. Yan, Y. Shi, J. Ye, Y. Xu, C. Li, B. Bi, Q. Qian, W. Wang, et al. mplug-2: A
modularized multi-modal foundation model across text, image and video. In International
Conference on Machine Learning, pages 38728–38748. PMLR, 2023.
[98] Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Alignment
data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint
arXiv:2406.08464, 2024.
[99] Y. Yao, T. Yu, A. Zhang, C. Wang, ...
[原文]Mm1.5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv
preprint arXiv:2409.20566, 2024.
[108] H.Zhang, H. You, P. Dufter, B. Zhang, C. Chen, H.-Y. Chen, T.-J. Fu, W. Y. Wang, S.-F. Chang, Z. Gan, et al. Ferret-v2: An improved baseline for referring and grounding with
large language models. arXiv preprint arXiv:2404.07973, 2024.
[109] R. Zhang, X. Wei, D. Jiang, Y. Zhang, Z. Guo, C. Tong, J. Liu, A. Zhou, B. Wei, S. Zhang,
P. Gao, and H. Li. Mavis: Mathematical visual instruction tuning, 2024. URL https:
//arxiv.org/abs/2407.08739.
[110] B. Zheng, B. Gou, J. Kil, H. Sun, and...