Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Janus：解耦视觉编码实现统一多模态理解与生成

📄 arXiv: 2410.13848📅 2024-10-19PDF

翻译进度55 / 55 段 (100%)

中文摘要

Janus 是统一的视觉理解与生成模型，通过解耦视觉编码实现高效的图文双向生成。该模型打破了理解与生成之间的壁垒，在一个统一的架构中同时实现图像理解和图像生成。采用创新的解码器设计，在保持理解能力的同时具备强大的生成能力。

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

【摘要】Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation - 本文介绍了Janus的架构、训练方法和实验结果。

原文: Chengyue Wu 1,2 Xiaokang Chen 1,∗,† Zhiyu Wu 1,3 Yiyang Ma 1,3 Xingchao Liu 1 Zizheng Pan 1 Wen Liu 1 Zhenda Xie 1 Xingkai Yu 1 Chong Ruan 1 Ping Luo 2,∗ 1 DeepSeek-AI 2 The University of Hong Kong 3 Peking University † : Project lead ∗ : Corresponding authors Project Page: https://github.com/deepseek-ai/Janus Abstract In this paper, we introduce Janus , an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity require...

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

（Janus: Decoupling Visual Encoding for Unified Mult - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: tion domains [ 51 , 20 ] . In the field of multimodal understanding, researchers follow the design of LLaVA [ 51 ] by using a vision encoder as a bridge to enable large language models (LLMs) to understand images. In the field of visual generation, diffusion-based approaches [ 20 , 9 , 67 , 20 ] have seen notable success. More recently, some works have explored autoregressive methods for vision generation [ 73 , 79 ] , achieving performance comparable to diffusion models. To build more powerful and generalist multimodal models, researchers have sought to combine multimodal understanding and ge...

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

原文: s representation tends to mainly focus on high-dimensional semantic representation. By contrast, in visual generation tasks, the main focus is on generating local details and maintaining global consistency in the image. The representation in this context necessitates a low-dimensional encoding that is capable of fine-grained spatial structure and textural detail expression. Unifying the representations of these two tasks within the same space will lead to conflicts and trade-offs. Consequently, existing unified models for multimodal understanding and generation often compromise on multimodal u...

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

原文: f-the-art encoding techniques specific to their domain. Moreover, it is possible for Janus to accommodate additional input types in the future, such as point clouds, EEG signals, or audio data, where independent encoders can extract features and then use a unified transformer to process them. To the best of our knowledge, we are the first to highlight the importance of decoupling visual encoding within the unified multimodal understanding and generation framework. Our experimental results show that Janus surpasses existing unified models with comparable parameter sizes on both multimodal under...

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

原文: y, masked prediction models [ 7 , 8 ] draw upon BERT-style [ 19 ] masking methods, predicting masked sections of visual inputs to improve synthesis efficiency, and have been adapted for video generation [ 89 ] . Concurrently, continuous diffusion models have showcased impressive capabilities in visual generation [ 33 , 71 , 67 ] , complementing discrete methods by approaching generation through a probabilistic lens. 2.2 Multimodal Understanding Multimodal large language models (MLLMs) integrate both text and images [ 6 , 80 , 81 ] . By leveraging pretrained LLMs, MLLMs [ 55 , 51 , 95 , 82 , 12...

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

原文: and generation. In contrast, our Janus can explicitly decouple the visual representations for understanding and generation, recognizing that different tasks may require varying levels of information. 3 Janus: A Simple, Unified and Flexible Multimodal Framework Figure 2: Architecture of our Janus. Different from previous approaches [ 77 , 85 ] that typically assume visual understanding and generation require the same visual encoder, our Janus decouples visual encoding for visual understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” ...

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

原文: nitialized prediction head is used for image predictions in the visual generation task. The entire model adheres to an autoregressive framework without the need for specially designed attention masks. 3.2 Training Procedure

1 Introduction

【引言】Janus的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: (a) Benchmark Performance. (b) Visual Generation Results. Figure 1: Multimodal understanding and vision generation results from our Janus . Janus outperforms the previous state-of-the-art unified multimodal models as well as some task-specific multimodal understanding models, while also demonstrating strong visual generation capabilities. The image resolution is 384 × 384 384 384 384\times 384 . Best viewed on screen. In recent years, multimodal large models have made significant advancements in both understanding and generation domains [ 51 , 20 ] . In the field of multimodal understanding, r...

1 Introduction

原文: single vision encoder to process inputs for both two tasks. However, the representations required by multimodal understanding and generation tasks differ significantly. In multimodal understanding tasks, the purpose of the vision encoder is to extract high-level semantic information (e.g., object categories or visual attributes within an image). The output of understanding task not only involves extracting information from images but also involves complex semantic reasoning. Therefore, the granularity of the vision encoder’s representation tends to mainly focus on high-dimensional semantic rep...

1 Introduction

原文: o independent visual encoding pathways: one for multimodal understanding and one for multimodal generation, unified by the same transformer architecture. The proposed method offers two main benefits: (1) Janus alleviates the conflict stemming from the different granular needs of multimodal understanding and generation and eliminates the need to make trade-offs between two tasks when selecting visual encoders. (2) Janus is flexible and extensible. After decoupling, both the understanding and generation tasks can adopt state-of-the-art encoding techniques specific to their domain. Moreover, it i...

1 Introduction

原文: e for next-generation unified multimodal models.

2 Related Work

（2 Related Work - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: 2.1 Visual Generation Visual generation is a rapidly evolving field that combines concepts from natural language processing with advancements in transformer architectures. Autoregressive models, influenced by the success in language processing, leverage transformers to predict sequences of discrete visual tokens (codebook IDs) [ 24 , 65 , 75 ] . These models tokenize visual data and employ a prediction approach similar to GPT-style [ 64 ] techniques. Additionally, masked prediction models [ 7 , 8 ] draw upon BERT-style [ 19 ] masking methods, predicting masked sections of visual inputs to impr...

2 Related Work

原文: odels typically use a single visual representation for both understanding and generation tasks, regardless of whether they are based on autoregressive (AR) models [ 77 , 85 ] or diffusion models [ 86 , 94 ] . For example, Chameleon [ 77 ] adopts a VQ Tokenizer to encode images for both multimodal understanding and generation. However, this practice may lead to suboptimal outcomes, as the vision encoder might face a trade-off between the demands of understanding and generation. In contrast, our Janus can explicitly decouple the visual representations for understanding and generation, recognizin...

2.1 Visual Generation

（2.1 Visual Generation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Visual generation is a rapidly evolving field that combines concepts from natural language processing with advancements in transformer architectures. Autoregressive models, influenced by the success in language processing, leverage transformers to predict sequences of discrete visual tokens (codebook IDs) [ 24 , 65 , 75 ] . These models tokenize visual data and employ a prediction approach similar to GPT-style [ 64 ] techniques. Additionally, masked prediction models [ 7 , 8 ] draw upon BERT-style [ 19 ] masking methods, predicting masked sections of visual inputs to improve synthesis efficien...

2.2 Multimodal Understanding

（2.2 Multimodal Understanding - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Multimodal large language models (MLLMs) integrate both text and images [ 6 , 80 , 81 ] . By leveraging pretrained LLMs, MLLMs [ 55 , 51 , 95 , 82 , 12 , 2 , 1 ] demonstrate a robust ability to understand and process multimodal information. Recent advancements have explored extending MLLMs with pretrained diffusion models to facilitate image generation [ 27 , 36 , 75 , 76 , 29 ] . These methods fall under the category of tool utilization, where diffusion models are used to generate images based on the conditions output by the MLLM, while the MLLM itself does not have the ability to directly pe...

2.3 Unified Multimodal Understanding and Generation

（2.3 Unified Multimodal Understanding and - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Unified multimodal understanding and generation models are considered powerful for facilitating seamless reasoning and generation across different modalities [ 77 , 94 ] . Traditional approaches in these models typically use a single visual representation for both understanding and generation tasks, regardless of whether they are based on autoregressive (AR) models [ 77 , 85 ] or diffusion models [ 86 , 94 ] . For example, Chameleon [ 77 ] adopts a VQ Tokenizer to encode images for both multimodal understanding and generation. However, this practice may lead to suboptimal outcomes, as the visi...

3 Janus: A Simple, Unified and Flexible Multimodal Framework

（3 Janus: A Simple, Unified and Flexible - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 2: Architecture of our Janus. Different from previous approaches [ 77 , 85 ] that typically assume visual understanding and generation require the same visual encoder, our Janus decouples visual encoding for visual understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are abbreviations for “Understanding Encoder” and “Generation Encoder”, respectively. Best viewed in color. 3.1 Architecture The architecture of Janus is shown in Figure 2 . For pure text understanding, multimodal understanding, and visual generation, we apply independent encoding methods to convert the ra...

3 Janus: A Simple, Unified and Flexible Multimodal Framework

原文: three stages, as illustrated in Figure 3 . Details are provided in the below. Stage I: Training Adaptors and Image Head. The main goal of this stage is to create a conceptual connection between visual and linguistic elements within the embedding space, enabling the LLM to understand the entities shown in images and have preliminary visual generation ability. We keep the visual encoders and the LLM frozen during this stage, allowing only the trainable parameters within the understanding adaptor, generation adaptor and image head to be updated. Figure 3: Our Janus adopts a three-stage training p...

3 Janus: A Simple, Unified and Flexible Multimodal Framework

原文: opy loss during training: ℒ = − ∑ i = 1 log ⁡ P θ ( x i | x < i ) ℒ subscript 𝑖 1 subscript 𝑃 𝜃 conditional subscript 𝑥 𝑖 subscript 𝑥 absent 𝑖 \displaystyle\mathcal{L}=-\sum_{i=1}\log P_{\theta}(x_{i}|x_{

3 Janus: A Simple, Unified and Flexible Multimodal Framework

原文: , etc. (2) To handle high-resolution images, dynamic high-resolution techniques [ 50 ] can be used. This allows the model to scale to any resolution, without performing positional embedding interpolation for ViTs. Tokens can be further compressed to save computational cost, for instance, using pixel shuffle operation [ 12 ] . Visual Generation. (1) For visual generation, finer-grained encoders can be chosen in order to preserve more image details after encoding, such as MoVQGan [ 93 ] . (2) Loss functions specifically designed for visual generation can be employed, such as diffusion loss [ 46 ...

3.1 Architecture

（3.1 Architecture - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The architecture of Janus is shown in Figure 2 . For pure text understanding, multimodal understanding, and visual generation, we apply independent encoding methods to convert the raw inputs into features, which are then processed by an unified autoregressive transformer. Specifically, for text understanding, we use the built-in tokenizer of the LLM to convert the text into discrete IDs and obtain the feature representations corresponding to each ID. For multimodal understanding, we use the SigLIP [ 92 ] encoder to extract high-dimensional semantic features from images. These features are flat...

3.2 Training Procedure

（3.2 Training Procedure - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: The training of Janus is divided into three stages, as illustrated in Figure 3 . Details are provided in the below. Stage I: Training Adaptors and Image Head. The main goal of this stage is to create a conceptual connection between visual and linguistic elements within the embedding space, enabling the LLM to understand the entities shown in images and have preliminary visual generation ability. We keep the visual encoders and the LLM frozen during this stage, allowing only the trainable parameters within the understanding adaptor, generation adaptor and image head to be updated. Figure 3: Our...

3.3 Training Objective

（3.3 Training Objective - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Janus is an autoregressive model, and we simply adopt the cross-entropy loss during training: ℒ = − ∑ i = 1 log ⁡ P θ ( x i | x < i ) ℒ subscript 𝑖 1 subscript 𝑃 𝜃 conditional subscript 𝑥 𝑖 subscript 𝑥 absent 𝑖 \displaystyle\mathcal{L}=-\sum_{i=1}\log P_{\theta}(x_{i}|x_{

3.4 Inference

（3.4 Inference - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: During inference, our model adopts a next-token prediction approach. For pure text understanding and multimodal understanding, we follow the standard practice of sampling tokens sequentially from the predicted distribution. For image generation, we utilize classifier-free guidance (CFG) 2 2 2 During training, we replace the text condition in the text-to-image data with a pad token at a probability of 10 10 10 %, enabling the model to have unconditional visual generation capability. , similar to prior works [ 26 , 73 , 8 ] . Specifically, for each token, the logit l g subscript 𝑙 𝑔 l_{g} is cal...

3.5 Possible Extensions

（3.5 Possible Extensions - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: It is important to note that our design, which features separate encoders for understanding and generation, is straightforward and easy to extend. Multimodal Understanding. (1) For the multimodal understanding component, a stronger vision encoder can be chosen without worrying about whether the encoder is capable of handling vision generation tasks, such as EVA-CLIP [ 74 ] , InternViT [ 13 ] , etc. (2) To handle high-resolution images, dynamic high-resolution techniques [ 50 ] can be used. This allows the model to scale to any resolution, without performing positional embedding interpolation f...

4 Experiments

（4 Experiments - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we present a series of comprehensive experiments designed to assess the performance of our method across a range of visual understanding and generation tasks. We begin by detailing our experimental setup, which includes the model architecture, training datasets, and evaluation benchmarks. Next, we report the performance of Janus, followed by a comparison with other state-of-the-art models on various benchmarks for multimodal understanding and generation. We also conduct extensive ablation studies to verify the effectiveness of the proposed method. Lastly, we provide some quali...

4 Experiments

原文: able 1: Detailed hyperparameters of our Janus . Data ratio refers to the ratio of multimodal understanding data, pure text data, and visual generation data. Hyperparameters Stage 1 Stage 2 Stage 3 Learning rate 1.0 × 10 − 3 1.0 superscript 10 3 1.0\times 10^{-3} 1 × 10 − 4 1 superscript 10 4 1\times 10^{-4} 2.0 × 10 − 5 2.0 superscript 10 5 2.0\times 10^{-5} LR scheduler Cosine Constant Constant Weight decay 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Optimizer AdamW ( β 1 = 0.9 , β 2 = 0.95 formulae-sequence subscript 𝛽 1 0.9 subscript 𝛽 2 0.95 \beta_...

4 Experiments

原文: 8.9 LLaVA [ 51 ] 7 7 7 B 76.3 76.3 76.3 809.6 809.6 809.6 38.7 38.7 38.7 33.5 33.5 33.5 - - - 25.5 25.5 25.5 LLaVA-v 1.5 1.5 1.5 [ 50 ] 7 7 7 B 85.9 85.9 85.9 1510.7 1510.7 1510.7 64.3 64.3 64.3 58.6 58.6 58.6 78.5 78.5 78.5 62.0 62.0 62.0 35.4 35.4 35.4 31.1 31.1 31.1 InstructBLIP [ 16 ] 7 7 7 B - - 36.0 36.0 36.0 53.4 53.4 53.4 - 49.2 49.2 49.2 - 26.2 26.2 26.2 Qwen-VL-Chat [ 3 ] 7 7 7 B - 1487.5 1487.5 1487.5 60.6 60.6 60.6 58.2 58.2 58.2 78.2 78.2 78.2 57.5 57.5 57.5 - - IDEFICS- 9 9 9 B [ 41 ] 8 8 8 B - - 48.2 48.2 48.2 - 50.9 50.9 50.9 38.4 38.4 38.4 - - Emu 3 3 3 -Chat [ 83 ] 8 8 8 B 85...

4 Experiments

原文: 1 1 k [ 18 ] for visual generation. The ShareGPT 4 4 4 V data is formatted as “ ”. The ImageNet data is organized into a text-to-image data format using the category names: “ ”. Here, the “<>” symbols represent placeholders. Stage II. We organize the data into the following categories. (1) Text-only data. We use pretraining text copus from DeepSeek-LLM [ 5 ] . (2) Interleaved image-text data. We use WikiHow [ 39 ] and WIT [ 72 ] dataset. (3) Image caption data. We use images from [ 18 , 38 , 17 , 40 , 23 , 47 , 49 , 45 , 70 ] . Among them, we employ open-...

4 Experiments

原文: use image-text pairs from [ 17 , 70 , 60 ] (a subset of that in stage II) and 4 4 4 M in-house data. We utilize the following format for instruction tuning:“ User: \n Assistant: ”. For multi-turn dialogues, we repeat this format to structure the data. 4.3 Evaluation Setup Multimodal Understanding. To assess multimodal understanding capabilities, we evaluate our model on widely recognized image-based vision-language benchmarks, which include VQAv2 [ 31 ] , GQA [ 35 ] , POPE [ 48 ] , MME [ 25 ] , SEED [ 42 ] , MMB [ 54 ] , MM-Vet [ 90 ] , and MMMU [ 91 ] . Visual Genera...

4 Experiments

原文: 0 0.50 0.44 0.44 0.44 0.80 0.80 0.80 0.08 0.08 0.08 0.07 0.07 0.07 0.48 0.48 0.48 SDv 2.1 2.1 2.1 [ 67 ] 0.9 0.9 0.9 B 0.98 0.98 0.98 0.51 0.51 0.51 0.44 0.44 0.44 0.85 0.85 0.85 0.07 0.07 0.07 0.17 0.17 0.17 0.50 0.50 0.50 DALL-E 2 2 2 [ 66 ] 6.5 6.5 6.5 B 0.94 0.94 0.94 0.66 0.66 0.66 0.49 0.49 0.49 0.77 0.77 0.77 0.10 0.10 0.10 0.19 0.19 0.19 0.52 0.52 0.52 Emu 3 3 3 -Gen [ 83 ] 8 8 8 B 0.98 0.98 0.98 0.71 0.71 0.71 0.34 0.34 0.34 0.81 0.81 0.81 0.17 0.17 0.17 0.21 0.21 0.21 0.54 0.54 0.54 SDXL [ 62 ] 2.6 2.6 2.6 B 0.98 0.98 0.98 0.74 0.74 0.74 0.39 0.39 0.39 0.85 0.85 0.85 0.15 0.15 0.15 0...

4 Experiments

原文: ighly competitive. For instance, Janus outperforms LLaVA-v 1.5 1.5 1.5 ( 7 7 7 B) on several datasets, including POPE, MMbench, SEED Bench, and MM-Vet. Visual Generation Performance. We report visual generation performance on GenEval, COCO- 30 30 30 K and MJHQ- 30 30 30 K benchmarks. As shown in Table 3 , our Janus obtains 61 61 61 % overall accuracy on GenEval, which outperforms the previous best unified model Show-o ( 53 53 53 %) and some popular generation-only methods, e.g., SDXL ( 55 55 55 %) and DALL-E 2 2 2 ( 52 52 52 %). This demonstrates that our approach has better instruction-follow...

4 Experiments

原文: LWM [ 52 ] 7 7 7 B 12.68 12.68 12.68 17.77 17.77 17.77 VILA-U ( 256 256 256 ) [ 85 ] 7 7 7 B - 12.81 12.81 12.81 VILA-U ( 384 384 384 ) [ 85 ] 7 7 7 B - 7.69 7.69 7.69 Janus (Ours) 1.3 1.3 1.3 B 8.53 8.53 8.53 10.10 10.10 10.10 Table 5: Ablation studies . We verify the effectiveness of decoupling visual encoding and compare unified training with task-specific training. “Und.”, “Gen.” and “SE. Tokenizer” denote “understanding”, “generation” and “semantic tokenizer”, respectively. Exp ID Visual Encoder Training Task POPE ↑ ↑ \uparrow MMB ↑ ↑ \uparrow SEED ↑ ↑ \uparrow MMMU ↑ ↑ \uparrow COCO-FID ...

4 Experiments

原文: n study as a stronger baseline. For simplicity, we use the ordinary VQ tokenizer [ 73 ] in the main experiment. that can extract high-level semantic information from images while also have the ability to convert images into discrete IDs, which is similar to that in [ 85 ] . Details of the semantic tokenizer could be found in the Appendix A.1 . Impact of Decoupling Visual Encoding. (1) From the results of Exp-A, we find the model achieves satisfactory performance on visual generation benchmark ( 8.72 8.72 8.72 FID on COCO). However, there is a significant gap on understanding benchmarks between...

4 Experiments

原文: erstanding data. Please note that unified training and pure understanding training go through the same steps for the understanding part. Similarly, unified training and pure generation training go through the same steps for the visual generation part. Experimental results show that the performance of unified training is comparable to that of training solely for understanding or solely for visual generation. This demonstrates that our model, Janus, is capable of incorporating strong generative abilities while minimally affecting multimodal understanding performance. Figure 4: Qualitative compar...

4 Experiments

原文: accurately recognizing the text in the image. Additionally, Chameleon fails to identify objects in the meme, while Show-o misinterprets the dog’s color. These examples highlight that the decoupled vision encoder significantly enhances Janus’s fine-grained multimodal understanding ability compared to the shared encoder used by Chameleon and Show-o. More multimodal understanding exmples can be found in the Appendix B .

4.1 Implementation Details

（4.1 Implementation Details - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In our experiments, we utilize DeepSeek-LLM ( 1.3 1.3 1.3 B) [ 5 ] with a maximum supported sequence length of 4096 4096 4096 as the base language model. For the vision encoder used in understanding tasks, we select SigLIP-Large-Patch 16 16 16 - 384 384 384 [ 92 ] . The generation encoder has a codebook of size 16 , 384 16 384 16,384 and downsamples images by a factor of 16 16 16 . Both the understanding adaptor and the generation adaptor are two-layer MLPs. The detailed hyperparameters for each stage are provided in Table 1 . All images are resized to 384 × 384 384 384 384\times 384 pixels. F...

4.1 Implementation Details

原文: 300 300 5 , 000 5 000 5,000 0 0 Training steps 10 , 000 10 000 10,000 180 , 000 180 000 180,000 24 , 000 24 000 24,000 Batch size 256 256 256 512 512 512 256 256 256 Data Ratio 1 : 0 : 1 : 1 0 : 1 1:0:1 2 : 3 : 5 : 2 3 : 5 2:3:5 7 : 3 : 10 : 7 3 : 10 7:3:10 Table 2: Comparison with state-of-the-arts on multimodal understanding benchmarks . “Und.” and “Gen.” denote “understanding” and “generation”, respectively. Models using external pretrained diffusion model are marked with † . Type Model # LLM Params POPE ↑ ↑ \uparrow MME-P ↑ ↑ \uparrow MMB ↑ ↑ \uparrow SEED ↑ ↑ \uparrow VQAv2 (test) ↑ ↑ \up...

4.1 Implementation Details

原文: 2 75.1 75.1 75.1 60.3 60.3 60.3 31.6 31.6 31.6 - InstructBLIP [ 16 ] 13 13 13 B 78.9 78.9 78.9 1212.8 1212.8 1212.8 - - - 49.5 49.5 49.5 - 25.6 25.6 25.6 Und. and Gen. DreamLLM † [ 21 ] 7 7 7 B - - - - 72.9 72.9 72.9 - - 36.6 36.6 36.6 LaVIT † [ 36 ] 7 7 7 B - - - - 66.0 66.0 66.0 46.8 46.8 46.8 - - Emu † [ 75 ] 13 13 13 B - - - - 52.0 52.0 52.0 - - - NExT-GPT † [ 84 ] 13 13 13 B - - - - 66.7 66.7 66.7 - - - \cdashline 2-11 Show-o [ 86 ] 1.3 1.3 1.3 B 73.8 73.8 73.8 948.4 948.4 948.4 - - 59.3 59.3 59.3 48.7 48.7 48.7 25.1 25.1 25.1 - Gemini-Nano-1 [ 78 ] 1.8 1.8 1.8 B - - - - 62.7 62.7 62.7 - ...

4.2 Data Setup

（4.2 Data Setup - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we provide details of the pretraining and supervised finetuning datasets. Stage I. We use a dataset that includes 1.25 1.25 1.25 million image-text paired captions from ShareGPT 4 4 4 V [ 10 ] for multimodal understanding and approximately 1.2 1.2 1.2 million samples from ImageNet- 1 1 1 k [ 18 ] for visual generation. The ShareGPT 4 4 4 V data is formatted as “ ”. The ImageNet data is organized into a text-to-image data format using the category names: “ ”. Here, the “<>” symbols represent placeholders. Stage II. We organize the data int...

4.2 Data Setup

原文: ne understanding, as suggested by [ 9 ] . The visual generation data is provided in the format: “ ”. Stage III. For text understanding, we use data from [ 43 ] . For multimodal understanding, we use instruct tuning data from [ 43 , 31 , 35 , 69 , 34 , 56 ] . For visual generation, we use image-text pairs from [ 17 , 70 , 60 ] (a subset of that in stage II) and 4 4 4 M in-house data. We utilize the following format for instruction tuning:“ User: \n Assistant: ”. For multi-turn dialogues, we repeat this format to structure the data.

4.3 Evaluation Setup

（4.3 Evaluation Setup - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Multimodal Understanding. To assess multimodal understanding capabilities, we evaluate our model on widely recognized image-based vision-language benchmarks, which include VQAv2 [ 31 ] , GQA [ 35 ] , POPE [ 48 ] , MME [ 25 ] , SEED [ 42 ] , MMB [ 54 ] , MM-Vet [ 90 ] , and MMMU [ 91 ] . Visual Generation. For evaluating visual generation capabilities, we use the MSCOCO- 30 30 30 K [ 11 ] , MJHQ- 30 30 30 K [ 44 ] , and GenEval [ 30 ] benchmarks. MSCOCO- 30 30 30 K and MJHQ- 30 30 30 K employ the Fréchet Inception Distance (FID) metric on generated images compared to 30 30 30 K high-quality ima...

4.3 Evaluation Setup

原文: 49 0.77 0.77 0.77 0.10 0.10 0.10 0.19 0.19 0.19 0.52 0.52 0.52 Emu 3 3 3 -Gen [ 83 ] 8 8 8 B 0.98 0.98 0.98 0.71 0.71 0.71 0.34 0.34 0.34 0.81 0.81 0.81 0.17 0.17 0.17 0.21 0.21 0.21 0.54 0.54 0.54 SDXL [ 62 ] 2.6 2.6 2.6 B 0.98 0.98 0.98 0.74 0.74 0.74 0.39 0.39 0.39 0.85 0.85 0.85 0.15 0.15 0.15 0.23 0.23 0.23 0.55 0.55 0.55 Und. and Gen. SEED-X † [ 29 ] 17 17 17 B 0.97 0.97 0.97 0.58 0.58 0.58 0.26 0.26 0.26 0.80 0.80 0.80 0.19 0.19 0.19 0.14 0.14 0.14 0.49 0.49 0.49 \cdashline 2-10 Show-o [ 86 ] 1.3 1.3 1.3 B 0.95 0.95 0.95 0.52 0.52 0.52 0.49 0.49 0.49 0.82 0.82 0.82 0.11 0.11 0.11 0.28 0...

4.4 Comparison with State-of-the-arts

（4.4 Comparison with State-of-the-arts - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Multimodal Understanding Performance. We compare the proposed method with state-of-the-art unified models and understanding-only models in Table 2 . Janus achieves the overall best results among models of similar scale. Specifically, compared to the previous best unified model, Show-o [ 86 ] , we achieve performance improvements of 41 41 41 % ( 949 → 1338 → 949 1338 949\rightarrow 1338 ) and 30 30 30 % ( 48.7 → 59.1 → 48.7 59.1 48.7\rightarrow 59.1 ) on the MME and GQA datasets, respectively. This can be attributed to Janus decoupling the visual encoding for multimodal understanding and genera...

4.4 Comparison with State-of-the-arts

原文: with † . Type Model # Params COCO-30K ↓ ↓ \downarrow MJHQ-30K ↓ ↓ \downarrow Gen. Only DALL·E [ 65 ] 12 12 12 B 27.50 27.50 27.50 - GLIDE [ 59 ] 5 5 5 B 12.24 12.24 12.24 - LDM [ 67 ] 1.4 1.4 1.4 B 12.64 12.64 12.64 - DALL·E 2 [ 66 ] 6.5 6.5 6.5 B 10.39 10.39 10.39 - SDv1.5 [ 67 ] 0.9 0.9 0.9 B 9.62 9.62 9.62 - GigaGAN [ 37 ] 0.9 0.9 0.9 B 9.09 9.09 9.09 - PixArt- α 𝛼 \alpha [ 9 ] 0.6 0.6 0.6 B 7.32 7.32 7.32 - Imagen [ 68 ] 34 34 34 B 7.27 7.27 7.27 - RAPHAEL [ 87 ] 3 3 3 B 6.61 6.61 6.61 - Und. and Gen. Emu † [ 75 ] 13 13 13 B 11.66 11.66 11.66 - NExT-GPT † [ 84 ] 13 13 13 B 11.28 11.28 11.2...

4.5 Ablation Studies

（4.5 Ablation Studies - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We carefully design ablation studies to verify the effectiveness of Janus’s design concept. First, we design experiments to validate the importance and benefits of decoupling visual encoding. Second, we investigate the impact of unified training on individual tasks like multimodal understanding or visual generation. Results are listed in Table 5 . Baseline Construction. Following previous work [ 77 ] , we select a VQ tokenizer [ 73 ] to encode images for both multimodal understanding and generation tasks, serving as the baseline (Exp-A). Considering that the VQ tokenizer in Exp-A might be weak...

4.5 Ablation Studies

原文: To investigate whether using a single visual encoder leads to a trade-off between multimodal understanding and generation, we further design Exp-C based on Exp-B, which focuses solely on multimodal understanding training. The multimodal understanding ability of Exp-C is significantly better than that of Exp-B. This indicates that the visual encoder in Exp-B made trade-offs between multimodal understanding and generation, ultimately sacrificing its multimodal understanding capability. The above experiments illustrate the importance of decoupling visual encoding. Unified Model vs. Pure Understan...

4.6 Qualitative Results

（4.6 Qualitative Results - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 5: Qualitative results of multimodal understanding on humorous memes . We compare the response with Chameleon- 7 7 7 B [ 77 ] and Show-o [ 86 ] . We emphasize the key-points in the response. Best viewed on screen. Visualizations of Visual Generation. Figure 4 provides qualitative comparisons between our model, diffusion-based models like SDXL [ 62 ] , and the autoregressive model LlamaGen [ 73 ] . The results show that our model demonstrates superior instruction-following capabilities in visual generation, accurately capturing most of details in the user’s prompt. This indicates the pot...

5 Conclusion

（5 Conclusion - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this paper, we introduced Janus, a simple, unified and extensible multimodal understanding and generation model. The core idea of Janus is to decouple visual encoding for multimodal understanding and generation, which could alleviate the conflict arising from the differing demands that understanding and generation place on the visual encoder. Extensive experiments have demonstrated the effectiveness and leading performance of Janus. It is also worth noting that Janus is flexible and easy to extend. In addition to having significant potential for improvement in both multimodal understanding ...

Appendix A Details of Semantic Tokenizer Mentioned in Ablation Study

（Appendix A Details of Semantic Tokenizer - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: A.1 Architecture of Semantic Tokenizer Figure 6: Architecture and usage of the semantic tokenizer. (a) Architecture used during training of the semantic tokenizer. We use pre-trained SigLIP [ 92 ] to supervise the reconstruction of semantic information, while using raw image to supervise the reconstruction of RGB values. (b) Integrating LLM with the semantic decoder. The semantic decoder outputs continuous features with high-level semantics, which are passed through an adaptor and then used as input for the LLM. Please note that the semantic tokenizer is only used in the ablation study, not in...

Appendix A Details of Semantic Tokenizer Mentioned in Ablation Study

原文: e maximize the cosine similarity between the semantic feature predicted by the semantic decoder and the SigLIP output. The weight for the semantic reconstruction loss is set to 0.25 0.25 0.25 . A.3 Integrating with LLM We present the integration of the semantic tokenizer and the LLM in Figure 6 (b). The image is first transformed into continuous features through the CNN encoder, vector quantization and the semantic decoder. Then, the LLM processes these features and generates predictions for the image IDs. Finally, the pixel decoder converts these discrete IDs into RGB values.

A.1 Architecture of Semantic Tokenizer

（A.1 Architecture of Semantic Tokenizer - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 6: Architecture and usage of the semantic tokenizer. (a) Architecture used during training of the semantic tokenizer. We use pre-trained SigLIP [ 92 ] to supervise the reconstruction of semantic information, while using raw image to supervise the reconstruction of RGB values. (b) Integrating LLM with the semantic decoder. The semantic decoder outputs continuous features with high-level semantics, which are passed through an adaptor and then used as input for the LLM. Please note that the semantic tokenizer is only used in the ablation study, not in the main experiment. We build the sema...

A.2 Training

（A.2 Training - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Training Procedure. The semantic tokenizer is trained from scratch in a two-stage manner. In the first stage, we train the model on the ImageNet- 1 1 1 k [ 18 ] dataset for 40 40 40 epochs. In the second stage, we fine-tune the model for 1 1 1 epoch on 50 50 50 million images. These images come from the visual generation data used during the Janus pretraining process. We use a constant learning rate of 1 e − 4 1 𝑒 4 1e-4 and a batch size of 128 128 128 . Training Loss. The training loss of the semantic tokenizer consists of two parts. On one hand, we use the loss for RGB reconstruction as de...

A.3 Integrating with LLM

（A.3 Integrating with LLM - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present the integration of the semantic tokenizer and the LLM in Figure 6 (b). The image is first transformed into continuous features through the CNN encoder, vector quantization and the semantic decoder. Then, the LLM processes these features and generates predictions for the image IDs. Finally, the pixel decoder converts these discrete IDs into RGB values.

Appendix B Additional Qualitative Results

（Appendix B Additional Qualitative Result - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: More Visualizations of Text-to-Image Generation. We present more text-to-image generation results in Figure 7 . It is evident that Janus is capable of producing high-quality images that adhere closely to the given prompts. We further explore the multilingual text-to-image capabilities of our model, as shown in Figure 8 . We are pleasantly surprised to find that, despite our training data consisting solely of English text-to-image data, Janus can still process text-to-image tasks in other languages. We attribute this multilingual ability to the original large language model’s inherent traits. T...

← 返回首页详细解读