Janus-Pro: Fast and Strong Multimodal Generation

Janus-Pro：快速强大的多模态生成模型

📄 arXiv: 2411.07975📅 2024-11-12PDF

翻译进度59 / 59 段 (100%)

中文摘要

Janus-Pro 是 Janus 的改进版本，在速度和性能上进一步提升多模态生成能力。采用优化的高效自回归生成架构，实现更快的图像生成速度和更高的图像质量。在图文生成、对话式图像编辑、视觉创意等任务上展现出强大的能力。

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

【摘要】Janus-Pro: Fast and Strong Multimodal Generation - 本文介绍了Janus-Pro的架构、训练方法和实验结果。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Yiyang Ma 1,2 Xingchao Liu 1,† Xiaokang Chen 1,† Wen Liu 1,† Chenyue Wu 1,3 Zhiyu Wu 1,2 Zizheng Pan 1 Zhenda Xie 1 Haowei Zhang 1 Xingkai Yu 1 Liang Zhao 1 Yisong Wang 1,4 Jiaying Liu 2 Chong Ruan 1,‡ 1 DeepSeek-AI 2 Peking University 3 The University of Hong Kong 4 Tsinghua University † Equal contribution ‡ Corresponding author Project Page: https://github.com/deepseek-ai/Janus Abstract We present JanusFlow , a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models wit...

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

（JanusFlow: Harmonizing Autoregression and Rectifie - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: have developed sophisticated models specialized in image comprehension [ 57 , 56 , 47 , 2 , 15 , 49 ] and text-to-image generation [ 77 , 74 , 71 , 23 ] . The field has recently shifted toward creating unified systems capable of handling both tasks simultaneously. One prominent direction involves utilizing pre-trained text-to-image models for high-quality generation while training LLMs to generate conditions for these models [ 25 , 26 , 84 , 27 , 19 ] . However, this approach introduces architectural complexity and potentially constrains the model’s capabilities through maintaining separate LL...

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

原文: image generation benchmarks, MJHQ FID- 30 30 30 k [ 48 ] , GenEval [ 28 ] and DPG-Bench [ 34 ] , JanusFlow achieves scores of 9.51 9.51 9.51 , 0.63 0.63 0.63 and 80.09 % percent 80.09 80.09\% , surpassing established text-to-image models including SDv1.5 [ 75 ] and SDXL [ 71 ] . In multimodal comprehension benchmarks, JanusFlow attains scores of 74.9 74.9 74.9 , 70.5 70.5 70.5 and 60.3 60.3 60.3 on MMBench [ 62 ] , SeedBench [ 46 ] , and GQA [ 35 ] , respectively, exceeding specialized models such as LLaVA-v1.5 [ 56 ] and Qwen-VL-Chat [ 4 ] . Notably, these results are achieved with a compact ...

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

原文: conditions for image generation without possessing direct generative capabilities. This separation often results in suboptimal performance compared to standalone diffusion models [ 25 , 84 ] . Another line of work [ 85 , 95 , 96 , 103 , 93 ] aim to train a single LLM for both tasks. Many of these methods employ vector-quantization [ 22 , 83 ] to convert images into discrete tokens, enabling unified autoregressive processing [ 85 , 93 ] . While straightforward to implement, these approaches are inherently limited by their image tokenization quality. Our work focuses on developing unified models...

1 Introduction

【引言】Janus-Pro的研究背景、动机和主要贡献。 DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: (a) Benchmark Performances. (b) Visual Generation Results. Figure 1: Multimodal understanding and image generation with JanusFlow . JanusFlow surpasses the state-of-the-art unified multimodal models and several task-specific understanding models on visual understanding benchmarks. It is also capable of generating high-quality images. The resolution of the images is 384 × 384 384 384 384\times 384 . Large language models (LLMs) have demonstrated remarkable capabilities in learning diverse knowledge and generalizing to new scenarios [ 88 , 68 , 1 , 8 , 7 ] . Leveraging these capabilities, resear...

1 Introduction

原文: w’s performance, we implement two key strategies: First, we maintain separate vision encoders for understanding and generation tasks, preventing task interference and thus enhancing comprehension capabilities. Second, we align the intermediate representations between generation and understanding modules during training, strengthening semantic coherence in the generation process. JanusFlow shows state-of-the-art performances in both multimodal comprehension and text-to-image generation compared to existing unified approaches, and even outperforms several specialized methods. Specifically, on te...

2 Related Work

（2 Related Work - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Visual Generation with Flow-based Generative Models. Recent years have witnessed remarkable progress in visual generation through diffusion models [ 32 , 80 ] , leading to impressive models like [ 75 , 71 , 76 , 77 , 74 , 66 ] . Building on these advances, flow-based generative models [ 60 , 55 , 3 ] emerged as a simplified alternative framework. These approaches have recently enabled advanced visual generation models [ 23 , 36 ] that achieve superior empirical performance with faster sampling. Our work demonstrates that rectified flow [ 60 , 59 , 61 ] can be effectively integrated into LLMs, ...

2 Related Work

原文: on models, leveraging their proven effectiveness in visual generation. Compared to similar approaches [ 96 , 103 ] , JanusFlow offers three key advantages: (i) a simple yet effective generation process using rectified flow, (ii) enhanced performance through decoupled vision encoders that resolve inter-task conflicts, and (iii) improved generation quality through representation alignment regularization, enabled by our decoupled encoder design.

3 JanusFlow

（3 JanusFlow - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: In this section, we introduce the architecture of JanusFlow and our training strategies. 3.1 Background Multimodal LLMs. Given a dataset 𝒟 𝒟 \mathcal{D} containing discrete token sequences, each of which can be formulated as x = ( x 1 , ⋯ , x ℓ ) 𝑥 subscript 𝑥 1 ⋯ subscript 𝑥 ℓ x=(x_{1},\cdots,x_{\ell}) , large language models (LLMs) are trained to model the sequence distribution in an autoregressive manner, log ⁡ P θ L L M ( x ) = ∑ i = 0 ℓ − 1 log ⁡ P θ L L M ( x i + 1 | x 1 , … , x i ) , subscript P subscript 𝜃 𝐿 𝐿 𝑀 𝑥 superscript subscript 𝑖 0 ℓ 1 subscript P subscript 𝜃 𝐿 𝐿 𝑀 ...

3 JanusFlow

原文: [ 0 , 1 ] 𝑡 0 1 t\in[0,1] : d z t d t = v θ N N ( z t , t ) , z 0 ∼ π 0 , formulae-sequence d subscript 𝑧 𝑡 d 𝑡 subscript 𝑣 subscript 𝜃 𝑁 𝑁 subscript 𝑧 𝑡 𝑡 similar-to subscript 𝑧 0 subscript 𝜋 0 \frac{{\rm{d}}z_{t}}{{\rm{d}}t}=v_{\theta_{NN}}(z_{t},t),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ z_{0}\sim\pi_{0}, (2) where θ N N subscript 𝜃 𝑁 𝑁 \theta_{NN} represents the parameters of the velocity neural network and π 0 subscript 𝜋 0 \pi_{0} is a simple distribution, typically standard Gaussian noise 𝒩 ( 0 , I ) 𝒩 0 𝐼 \mat...

3 JanusFlow

原文: superscript subscript 0 1 subscript 𝑣 subscript superscript 𝜃 𝑁 𝑁 subscript 𝑧 𝑡 𝑡 differential-d 𝑡 z_{1}=\int_{0}^{1}v_{\theta^{*}_{NN}}(z_{t},t){{\rm{d}}}t , with z 0 ∼ π 0 similar-to subscript 𝑧 0 subscript 𝜋 0 z_{0}\sim\pi_{0} , follows π 1 subscript 𝜋 1 \pi_{1} . Despite its conceptual simplicity, rectified flow has shown superior performance in various generative modeling tasks, including text-to-image generation [ 23 ] , audio generation [ 40 ] and biological structure generation [ 38 ] . Figure 2: Architecture of the proposed JanusFlow. For visual understanding, the LLM performs autoreg...

3 JanusFlow

原文: e m b subscript 𝐻 𝑖 𝑚 subscript 𝑊 𝑖 𝑚 subscript 𝐷 𝑒 𝑚 𝑏 H_{im}W_{im}\times D_{emb} . H i m subscript 𝐻 𝑖 𝑚 H_{im} and W i m subscript 𝑊 𝑖 𝑚 W_{im} are determined by the image encoder. The text and image embeddings are concatenated to form the input sequence to the LLM, which then autoregressively predicts the next tokens based on the input sequence of embeddings. According to common practice [ 93 , 85 , 96 ] , we add special token |BOI| before the image and |EOI| after the image to help the model locate the image embeddings in the sequence. Image Generation. For image generation, our L...

3 JanusFlow

原文: e c subscript 𝑔 𝑑 𝑒 𝑐 g_{dec} , producing a velocity vector of shape H l a t e n t × W l a t e n t × D l a t e n t subscript 𝐻 𝑙 𝑎 𝑡 𝑒 𝑛 𝑡 subscript 𝑊 𝑙 𝑎 𝑡 𝑒 𝑛 𝑡 subscript 𝐷 𝑙 𝑎 𝑡 𝑒 𝑛 𝑡 H_{latent}\times W_{latent}\times D_{latent} . The state is updated by a standard Euler solver, z t + d t = z t + v ( z t , t ) d t , subscript 𝑧 𝑡 d 𝑡 subscript 𝑧 𝑡 𝑣 subscript 𝑧 𝑡 𝑡 d 𝑡 z_{t+{\rm{d}}t}=z_{t}+v(z_{t},t){\rm{d}}t, (4) where d t d 𝑡 {\rm{d}}t is a user-defined step size. We replace z 0 subscript 𝑧 0 z_{0} with z d t subscript 𝑧 d 𝑡 z_{{\rm{d}}t} on ...

3 JanusFlow

原文: th tasks in the same VAE latent space using a shared U-Net or linear encoder, while Xie et al. [ 96 ] leverages MAGVIT-v2 [ 98 ] to encode image patches into discrete tokens for both tasks. However, recent work on unified autoregressive models has shown this shared encoder design to be suboptimal [ 93 ] , particularly in models that generate images through autoregression on vector-quantized tokens. Drawing from these insights, JanusFlow adopts a decoupled encoder design. Specifically, we employ a pre-trained SigLIP-Large-Patch/16 [ 102 ] model as f e n c subscript 𝑓 𝑒 𝑛 𝑐 f_{enc} to extrac...

3 JanusFlow

原文: image generation, and text-only data. We initially allocate a higher proportion of multimodal understanding data to establish the model’s understanding capabilities. Subsequently, we increase the ratio of image generation data to accommodate the convergence requirements of diffusion-based models [ 18 , 70 ] . Stage 3: Supervised Fine-Tuning (SFT). In the final stage, we fine-tune the pre-trained model using instruction tuning data, which comprises dialogues, task-specific conversations, and high-quality text-conditioned image generation examples. During this stage, we also unfreeze the SigLIP ...

3 JanusFlow

原文: cript 𝑔 𝑒 𝑛 𝑐 g_{enc} , g d e c subscript 𝑔 𝑑 𝑒 𝑐 g_{dec} and the linear transformation layers. Autoregression Objective. For mutimodal understanding tasks, x r e s superscript 𝑥 𝑟 𝑒 𝑠 x^{res} contains only text tokens. JanusFlow is trained using the maximum likelihood principle, ℒ A R ( θ ) = − 𝔼 x ∼ 𝒟 u n d [ ∑ i = ℓ c o n ℓ − 1 log ⁡ P θ ( x i + 1 | x 1 , … , x i ) ] , subscript ℒ 𝐴 𝑅 𝜃 subscript 𝔼 similar-to 𝑥 subscript 𝒟 𝑢 𝑛 𝑑 delimited-[] superscript subscript 𝑖 subscript ℓ 𝑐 𝑜 𝑛 ℓ 1 subscript P 𝜃 conditional subscript 𝑥 𝑖 1 subscript 𝑥 1 … subscript 𝑥 𝑖 \mathcal{...

3 JanusFlow

原文: e text prompts in training. Representation Alignment Regularization. Recent work [ 99 ] has shown that aligning intermediate representations between diffusion transformers and semantic vision encoders enhances diffusion model generalization. Our decoupled vision encoder design enables efficient implementation of this alignment as a regularization term. Specifically, for generation tasks, we align features from the understanding encoder f e n c subscript 𝑓 𝑒 𝑛 𝑐 f_{enc} with the LLM’s intermediate features, ℒ R E P A ( θ , φ ) = − 𝔼 x ∼ 𝒟 g e n [ sim ( stop_grad ( f e n ...

3 JanusFlow

原文: n = W i m subscript 𝑊 𝑔 𝑒 𝑛 subscript 𝑊 𝑖 𝑚 W_{gen}=W_{im} . The gradient of ℒ R E P A subscript ℒ 𝑅 𝐸 𝑃 𝐴 \mathcal{L}_{REPA} is not back-propagated through the understanding encoder. This alignment loss helps the LLM’s internal feature space (given noisy input z t subscript 𝑧 𝑡 z_{t} ) align with the understanding encoder’s semantic feature space, thereby improving generation quality when producing images from new random noise and text conditions during inference. Summary. All three objectives are applied across all training stages. Multimodal understanding tasks use ℒ A R subscript...

3.1 Background

（3.1 Background - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Multimodal LLMs. Given a dataset 𝒟 𝒟 \mathcal{D} containing discrete token sequences, each of which can be formulated as x = ( x 1 , ⋯ , x ℓ ) 𝑥 subscript 𝑥 1 ⋯ subscript 𝑥 ℓ x=(x_{1},\cdots,x_{\ell}) , large language models (LLMs) are trained to model the sequence distribution in an autoregressive manner, log ⁡ P θ L L M ( x ) = ∑ i = 0 ℓ − 1 log ⁡ P θ L L M ( x i + 1 | x 1 , … , x i ) , subscript P subscript 𝜃 𝐿 𝐿 𝑀 𝑥 superscript subscript 𝑖 0 ℓ 1 subscript P subscript 𝜃 𝐿 𝐿 𝑀 conditional subscript 𝑥 𝑖 1 subscript 𝑥 1 … subscript 𝑥 𝑖 \log{{\rm{P}}}_{\theta_{LLM}}(x)=\sum_{i=0}^{\...

3.1 Background

原文: bscript 𝑧 𝑡 d 𝑡 subscript 𝑣 subscript 𝜃 𝑁 𝑁 subscript 𝑧 𝑡 𝑡 similar-to subscript 𝑧 0 subscript 𝜋 0 \frac{{\rm{d}}z_{t}}{{\rm{d}}t}=v_{\theta_{NN}}(z_{t},t),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ z_{0}\sim\pi_{0}, (2) where θ N N subscript 𝜃 𝑁 𝑁 \theta_{NN} represents the parameters of the velocity neural network and π 0 subscript 𝜋 0 \pi_{0} is a simple distribution, typically standard Gaussian noise 𝒩 ( 0 , I ) 𝒩 0 𝐼 \mathcal{N}(0,I) . The network is trained by minimizing the Euclidean distance between the neural velocity a...

3.1 Background

原文: }=\int_{0}^{1}v_{\theta^{*}_{NN}}(z_{t},t){{\rm{d}}}t , with z 0 ∼ π 0 similar-to subscript 𝑧 0 subscript 𝜋 0 z_{0}\sim\pi_{0} , follows π 1 subscript 𝜋 1 \pi_{1} . Despite its conceptual simplicity, rectified flow has shown superior performance in various generative modeling tasks, including text-to-image generation [ 23 ] , audio generation [ 40 ] and biological structure generation [ 38 ] . Figure 2: Architecture of the proposed JanusFlow. For visual understanding, the LLM performs autoregressive next-token prediction to generate responses. For image generation, the LLM employs images with ...

3.2 A Unified Framework for Multimodal Understanding and Generation

（3.2 A Unified Framework for Multimodal U - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: JanusFlow presents a unified framework designed to address both vision understanding and image generation tasks. Next we outline how JanusFlow handles these two tasks within a single LLM architecture. Multimodal Understanding. In multimodal understanding tasks, the LLM processes an input sequence consisting of interleaved text and image data. The text is tokenized into discrete tokens, each of which is transformed into an embedding of dimension D e m b subscript 𝐷 𝑒 𝑚 𝑏 D_{emb} . For the images, an image encoder f e n c subscript 𝑓 𝑒 𝑛 𝑐 f_{enc} encodes each image x i m subscript 𝑥 𝑖...

3.2 A Unified Framework for Multimodal Understanding and Generation

原文: imes W_{latent}\times D_{latent} in the latent space, which is then processed by a generation encoder g e n c subscript 𝑔 𝑒 𝑛 𝑐 g_{enc} into a sequence of embeddings H g e n W g e n × D e m b subscript 𝐻 𝑔 𝑒 𝑛 subscript 𝑊 𝑔 𝑒 𝑛 subscript 𝐷 𝑒 𝑚 𝑏 H_{gen}W_{gen}\times D_{emb} . This sequence is concatenated with a time embedding representing the current time step t 𝑡 t ( t = 0 𝑡 0 t=0 at the beginning), resulting in a sequence of length H g e n W g e n + 1 subscript 𝐻 𝑔 𝑒 𝑛 subscript 𝑊 𝑔 𝑒 𝑛 1 H_{gen}W_{gen}+1 . Unlike previous approaches that employ various attention...

3.2 A Unified Framework for Multimodal Understanding and Generation

原文: )+(1-w)v(z_{t},t\leavevmode\nobreak\ |\leavevmode\nobreak\ \varnothing), (5) where v ( z t , t | ∅ ) 𝑣 subscript 𝑧 𝑡 conditional 𝑡 v(z_{t},t\leavevmode\nobreak\ |\leavevmode\nobreak\ \varnothing) denotes the velocity inferred without text conditioning and w ⩾ 1 𝑤 1 w\geqslant 1 controls the magnitute of CFG. Empirically, increasing w 𝑤 w yields higher semantic alignment [ 75 , 61 , 71 , 23 ] . Analogous to multimodal understanding, we prepend the special token |BOI| to indicate the start of image generation in the sequence. Decoupling Encoders for the Two Tasks. Previous approaches that unif...

3.2 A Unified Framework for Multimodal Understanding and Generation

原文: coupled encoder design significantly improves the performance of our unified model. The complete architecture of JanusFlow is illustrated in Fig. 2 .

3.3 Training Schemes

（3.3 Training Schemes - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: As illustrated in Fig. 3 , we train our model in three sequential stages, detailed below. Stage 1: Adaptation of Randomly Initialized Components. In the first stage, we focus on training only the randomly initialized components: the linear layers, generation encoder, and generation decoder. This stage serves to adapt these new modules to work effectively with the pre-trained LLM and SigLIP encoder, essentially functioning as an initialization phase for the newly introduced components. Stage 2: Unified Pre-Training. Following the adaptation stage, we train the entire model except for the visual...

3.4 Training Objective

（3.4 Training Objective - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Training JanusFlow involves two types of data, multimodal understanding data and image generation data. Both types of data contain two parts: “condition” and “response”. “Condition” refers to the prompting of the tasks ( e . g ., text prompts in the task of generation and images in the task of understanding) while “response” refers to the corresponding responses of the two tasks. The data can be formatted as x = ( x c o n , x r e s ) 𝑥 superscript 𝑥 𝑐 𝑜 𝑛 superscript 𝑥 𝑟 𝑒 𝑠 x=(x^{con},x^{res}) , where the superscript c o n 𝑐 𝑜 𝑛 con denotes “condition” and r e s 𝑟 𝑒 𝑠 res deno...

3.4 Training Objective

原文: cript 𝒟 𝑢 𝑛 𝑑 \mathcal{D}_{und} , computing loss only over tokens in x r e s superscript 𝑥 𝑟 𝑒 𝑠 x^{res} . Rectified Flow Objective. For image generation tasks, x c o n superscript 𝑥 𝑐 𝑜 𝑛 x^{con} consists of text tokens and x r e s superscript 𝑥 𝑟 𝑒 𝑠 x^{res} is the corresponding image. JanusFlow is trained with the rectified flow objective, ℒ R F ( θ ) = 𝔼 x ∼ 𝒟 g e n , t ∼ P ( t ) , z 0 ∼ 𝒩 ( 0 , I ) [ | | v θ ( z t , t | x c o n ) − ( x r e s − z 0 ) | | 2 ] , \mathcal{L}_{RF}(\theta)={{\mathbb{E}}}_{x\sim\mathcal{D}_{gen},t\sim{{\rm{P}}}(t),z_{0}\sim\mathcal{...

3.4 Training Objective

原文: (x^{res})),h_{\varphi}(q_{\theta}(z_{t}))\right)\right], (8) where q θ ( z t ) subscript 𝑞 𝜃 subscript 𝑧 𝑡 q_{\theta}(z_{t}) denotes an intermediate LLM representation given input z t subscript 𝑧 𝑡 z_{t} , and h φ subscript ℎ 𝜑 h_{\varphi} is a small trainable MLP that projects q θ ( z t ) subscript 𝑞 𝜃 subscript 𝑧 𝑡 q_{\theta}(z_{t}) to dimension D e n c subscript 𝐷 𝑒 𝑛 𝑐 D_{enc} . The function sim ( ⋅ , ⋅ ) sim ⋅ ⋅ \text{sim}(\cdot,\cdot) computes the mean of element-wise cosine similarity between embeddings. Before computing the loss, we reshape h φ ( q θ ( z t ) ) subscript ℎ...

3.4 Training Objective

原文: ng data, image generation data and text-only data. In the initial 10 , 000 10 000 10,000 steps of Stage 2, we apply a data ratio of 30 : 50 : 20 : 30 50 : 20 30:50:20 to boost the understanding ability. Stage 1 Stage 2 Stage 3 Learning Rate 1.0 × 10 − 4 1.0 superscript 10 4 1.0\times 10^{-4} 1 × 10 − 4 1 superscript 10 4 1\times 10^{-4} 2.0 × 10 − 5 2.0 superscript 10 5 2.0\times 10^{-5} LR Scheduler Constant Constant Constant Weight Decay 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Gradient Clip 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Optimizer AdamW ( β 1 = 0.9 , β 2 = 0.95 formulae-sequence subscript 𝛽...

4 Experiments

（4 Experiments - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We conduct extensive experiments to evaluate the capabilities of JanusFlow in both multimodal understanding and generation tasks. First, we describe our experimental setup and implementation details. Then, we present results on standard benchmarks for multimodal understanding and image generation. Finally, we perform ablation studies to validate our key design choices. 4.1 Experiment Setup and Implementation Details Our framework builds upon an enhanced version 1 1 1 This version, trained on an expanded text corpus compared to the one in Janus [ 93 ] , has been demonstrated to possess better p...

4 Experiments

原文: e-layer MLP as h φ subscript ℎ 𝜑 h_{\varphi} . We employ an exponential moving average (EMA) with a ratio of 0.99 to ensure training stability. For data preprocessing, we deal with understanding and generation data differently. For understanding tasks, we maintain all image information by resizing the long side to the target size and padding the image to squares. For generation tasks, we resize the short side to the target size and apply random square cropping to avoid padding artifacts. During training, multiple sequences are packed to form a single sequence of length 4 , 096 4 096 4,096 for ...

4 Experiments

原文: , 43 , 67 , 69 , 82 , 21 , 79 ] and 2 2 2 million in-house data. We enhance them with machine-generated captions using multimodal understanding models. We filter the images in [ 16 , 79 ] with aspect ratios and aesthetic scores, retaining approximately 20 % percent 20 20\% of the original datasets. 25 % percent 25 25\% of the data contains single-sentence captions. These kind of data assist the model to be able to process short prompts. All the data points are formatted as “ ”. 3. Text-Only Data. We directly use the text corpus of DeepSeek-LLM [ 7 ] . Data for Stage 3. The SFT ...

4 Experiments

原文: .50 DALL-E 2 2 2 [ 74 ] 6.5 6.5 6.5 B 0.94 0.94 0.94 0.66 0.66 0.66 0.49 0.49 0.49 0.77 0.77 0.77 0.10 0.10 0.10 0.19 0.19 0.19 0.52 0.52 0.52 Emu 3 3 3 -Gen [ 91 ] 8 8 8 B 0.98 0.98 0.98 0.71 0.71 0.71 0.34 0.34 0.34 0.81 0.81 0.81 0.17 0.17 0.17 0.21 0.21 0.21 0.54 0.54 0.54 SDXL [ 71 ] 2.6 2.6 2.6 B 0.98 0.98 0.98 0.74 0.74 0.74 0.39 0.39 0.39 0.85 0.85 0.85 0.15 0.15 0.15 0.23 0.23 0.23 0.55 0.55 0.55 IF-XL [ 17 ] 4.3 4.3 4.3 B 0.97 0.97 0.97 0.74 0.74 0.74 0.66 0.66 0.66 0.81 0.81 0.81 0.13 0.13 0.13 0.35 0.35 0.35 0.61 0.61 0.61 DALL-E 3 3 3 [ 6 ] - 0.96 0.96 0.96 0.87 0.87 0.87 0.47 0.4...

4 Experiments

原文: n Settings Image Generation. We evaluate the generated images using both visual quality and semantic accuracy metrics. For visual quality assessment, we employ the Fréchet Inception Distance [ 30 ] (FID) metric and compute FID between 30,000 generated images and their corresponding reference images from the MJHQ dataset [ 48 ] . The FID computation follows the implementation from GigaGAN [ 39 ] . To evaluate semantic accuracy, we utilize two specialized frameworks: GenEval [ 28 ] and DPG-Bench [ 34 ] . These frameworks are designed to assess whether the generated images accurately contain the ...

4 Experiments

原文: the ability of instruction following of our model. We give the comparisons on MJHQ FID-30k in Tab. 4 . The images which are sampled to calculate FID are generated with a CFG factor w = 2 𝑤 2 w=2 and a number of sampling steps 30 30 30 . We sweep the CFG factor and the sampling steps and provide the results in the appendix. Our method achieves the best performance among all the models with 1.3B LLM. The results prove that the rectified flow is able to improve the quality of generated images over autoregressive models such as Janus [ 93 ] . Multimodal Understanding Performances. We show comparis...

4 Experiments

原文: 7 7 B 76.3 76.3 76.3 809.6 809.6 809.6 38.7 38.7 38.7 33.5 33.5 33.5 - - - 25.5 25.5 25.5 LLaVA-v 1.5 1.5 1.5 [ 56 ] 7 7 7 B 85.9 85.9 85.9 1510.7 1510.7 1510.7 64.3 64.3 64.3 58.6 58.6 58.6 78.5 78.5 78.5 62.0 62.0 62.0 35.4 35.4 35.4 31.1 31.1 31.1 InstructBLIP [ 15 ] 7 7 7 B - - 36.0 36.0 36.0 53.4 53.4 53.4 - 49.2 49.2 49.2 - 26.2 26.2 26.2 Qwen-VL-Chat [ 4 ] 7 7 7 B - 1487.5 1487.5 1487.5 60.6 60.6 60.6 58.2 58.2 58.2 78.2 78.2 78.2 57.5 57.5 57.5 - - IDEFICS- 9 9 9 B [ 44 ] 8 8 8 B - - 48.2 48.2 48.2 - 50.9 50.9 50.9 38.4 38.4 38.4 - - Emu 3 3 3 -Chat [ 91 ] 8 8 8 B 85.2 85.2 85.2 - 58.5...

4 Experiments

原文: .1 30.5 30.5 30.5 34.3 34.3 34.3 JanusFlow (Ours) 1.3B 88.0 1333.1 74.9 70.5 79.8 60.3 29.3 30.9 4.5 Ablation Studies Table 6: Ablation studies. The weights of the modules with † are frozen during training. “Exp.” denotes “experiment”. “FID” in this table is MJHQ FID-10k with CFG factor w = 7.5 𝑤 7.5 w=7.5 and 30 steps. “CLIP” denotes CLIP similarity with the backbone of CLIP-ViT-Large-Patch/14. Exp. F is the final configuration for training JanusFlow. Exp. ID Model Setting Train. Iter. Evaluation Benchmarks REPA Und. Modules Gen. Modules Type POPE ↑ ↑ \uparrow VQAv2 ↑ v a l {}_{val}\uparr...

4 Experiments

原文: ant benefits of incorporating representation alignment regularization [ 99 ] during training. Specifically, models trained with representation alignment show notably lower FID scores on MJHQ dataset and higher CLIP scores, indicating simultaneous improvements in both image quality and semantic alignment. Importantly, our architecture differs from previous studies [ 70 , 65 ] examined in [ 99 ] due to our incorporation of LLM and an additional skip connection between g e n c subscript 𝑔 𝑒 𝑛 𝑐 g_{enc} and g d e c subscript 𝑔 𝑑 𝑒 𝑐 g_{dec} . The effectiveness of representation alignment i...

4 Experiments

原文: epresent these specialized models, trained with data volumes matching the unified models in Tab. 6 . The minimal performance gap between Exp. F and these task-specific baselines demonstrates that our unified framework successfully integrates understanding and generation capabilities without significant compromise in either task’s performance. Figure 5: Visual Understanding with JanusFlow. Our model effectively handles various visual understanding tasks, such as question answering, plot interpretation and object counting. 4.6 Qualitative Results We present qualitative evaluations of our method ...

4.1 Experiment Setup and Implementation Details

（4.1 Experiment Setup and Implementation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Our framework builds upon an enhanced version 1 1 1 This version, trained on an expanded text corpus compared to the one in Janus [ 93 ] , has been demonstrated to possess better performance on multiple-choice benchmarks (e.g., MMBench [ 62 ] and SEED Bench [ 46 ] ). Our preliminary experiments suggest that it has minimal impact on the quality of visual generation. of DeepSeek-LLM (1.3B) [ 7 , 63 ] . The LLM consists of 24 transformer blocks and supports a sequence length of 4 , 096 4 096 4,096 . In our model, both understanding and generation exploits images of resolution 384. For multimodal ...

4.1 Experiment Setup and Implementation Details

原文: target size and apply random square cropping to avoid padding artifacts. During training, multiple sequences are packed to form a single sequence of length 4 , 096 4 096 4,096 for training efficiency. Our implementation is based on the HAI-LLM platform [ 31 ] using PyTorch [ 72 ] . Training was conducted on NVIDIA A100 GPUs, with each model requiring ∼ 1 , 600 similar-to absent 1 600 \sim 1,600 A100 GPU days.

4.2 Training Data Settings

（4.2 Training Data Settings - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We follow Janus [ 93 ] to construct the training data. The data configuration for each training stage is listed below. Data for Stage 1 and Stage 2. The first two stages of our framework uses three types of data: multimodal understanding data, image generation data and text-only data. 1. Multimodal Understanding Data. This type of data contains several sub-categories: (a) Image caption data. We incorporate caption datasets from [ 20 , 41 , 50 , 51 , 53 , 79 ] and generate additional captions for images from [ 16 , 43 ] using open-source multimodal understanding models. The data follows templat...

4.2 Training Data Settings

原文: uction format: “ User:\n\n Assistant: ”. 3. Text-Only Data. We directly incorporate the text-only data from [ 47 ] . Table 2: Performances on GenEval benchmark. “Gen.” denotes “generation” and “Unified” denotes unified understanding and generation models. Models using external pre-trained generative models are signed with † . Type Method Params Single Obj. Two Obj. Count. Colors Pos. Color Attri. Overall ↑ ↑ \uparrow Gen. Only LlamaGen [ 83 ] 0.8 0.8 0.8 B 0.71 0.71 0.71 0.34 0.34 0.34 0.21 0.21 0.21 0.58 0.58 0.58 0.07 0.07 0.07 0.04 0.04 0.04 0.32 0.32 0.32 LDM [ 75 ] 1.4...

4.2 Training Data Settings

原文: † [ 27 ] 17 17 17 B 0.97 0.97 0.97 0.58 0.58 0.58 0.26 0.26 0.26 0.80 0.80 0.80 0.19 0.19 0.19 0.14 0.14 0.14 0.49 0.49 0.49 Show-o [ 96 ] 1.3 1.3 1.3 B 0.95 0.95 0.95 0.52 0.52 0.52 0.49 0.49 0.49 0.82 0.82 0.82 0.11 0.11 0.11 0.28 0.28 0.28 0.53 0.53 0.53 Janus [ 93 ] 1.3 1.3 1.3 B 0.97 0.97 0.97 0.68 0.68 0.68 0.30 0.30 0.30 0.84 0.84 0.84 0.46 0.46 0.46 0.42 0.42 0.42 0.61 0.61 0.61 JanusFlow (Ours) 1.3B 0.97 0.59 0.45 0.83 0.53 0.42 0.63 Table 3: Performances on DPG-Bench. The methods in this table are all generation-specific models except our method. Method Global Entity Attribute Relati...

4.3 Evaluation Settings

（4.3 Evaluation Settings - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Image Generation. We evaluate the generated images using both visual quality and semantic accuracy metrics. For visual quality assessment, we employ the Fréchet Inception Distance [ 30 ] (FID) metric and compute FID between 30,000 generated images and their corresponding reference images from the MJHQ dataset [ 48 ] . The FID computation follows the implementation from GigaGAN [ 39 ] . To evaluate semantic accuracy, we utilize two specialized frameworks: GenEval [ 28 ] and DPG-Bench [ 34 ] . These frameworks are designed to assess whether the generated images accurately contain the objects and...

4.4 Quantitative Results

（4.4 Quantitative Results - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 4: Results of MJHQ FID-30k. The models which have similar scales to our model are marked with blue background. JanusFlow achieves the best FID among 1.3B models. Method Params FID ↓ ↓ \downarrow LWM [ 58 ] 7B 17.77 VILA-U 256 [ 95 ] 7B 12.81 VILA-U 384 [ 95 ] 7B 7.69 Show-o [ 96 ] 1.3B 15.18 Janus [ 93 ] 1.3B 10.10 JanusFlow (Ours) 1.3B 9.51 Image Generation Performances. We report the performances on GenEval, DPG-Bench and MJHQ FID-30k. In Tab. 2 , we give comparisons on GenEval including the scores of all the sub-tasks and the overall score. JanusFlow achieves an overall score of 0.63,...

4.4 Quantitative Results

原文: d generation. Table 5: Comparison with other methods on multimodal understanding benchmarks . “Und.” denotes “understanding” and “Unified” denotes unified understanding and generation models. The models employing external pre-trained generative models are marked with † . The models with LLMs which have similar number of parameters to us are marked with blue background under the line of dashes. Type Model LLM Params POPE ↑ ↑ \uparrow MME-P ↑ ↑ \uparrow MMB ↑ d e v {}_{dev}\uparrow SEED ↑ ↑ \uparrow VQAv2 test ↑ ↑ \uparrow GQA ↑ ↑ \uparrow MMMU ↑ ↑ \uparrow MM-Vet ↑ ↑ \uparrow Und. Only Mobi...

4.4 Quantitative Results

原文: [ 13 ] 1.4 1.4 1.4 B 84.3 84.3 84.3 1302.8 1302.8 1302.8 57.7 57.7 57.7 - - 59.3 59.3 59.3 - - Unified Gemini-Nano-1 [ 86 ] 1.8 1.8 1.8 B - - - - 62.7 62.7 62.7 - 26.3 26.3 26.3 - LWM [ 58 ] 7 7 7 B 75.2 75.2 75.2 - - - 55.8 55.8 55.8 44.8 44.8 44.8 - 9.6 9.6 9.6 VILA-U [ 95 ] 7 7 7 B 85.8 85.8 85.8 1401.8 1401.8 1401.8 - 59.0 59.0 59.0 79.4 79.4 79.4 60.8 60.8 60.8 - 33.5 33.5 33.5 Chameleon [ 85 ] 7 7 7 B - - - - - - 22.4 22.4 22.4 8.3 8.3 8.3 DreamLLM † [ 19 ] 7 7 7 B - - - - 72.9 72.9 72.9 - - 36.6 36.6 36.6 LaVIT † [ 37 ] 7 7 7 B - - - - 66.0 66.0 66.0 46.8 46.8 46.8 - - Emu † [ 84 ] 13 1...

4.5 Ablation Studies

（4.5 Ablation Studies - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Table 6: Ablation studies. The weights of the modules with † are frozen during training. “Exp.” denotes “experiment”. “FID” in this table is MJHQ FID-10k with CFG factor w = 7.5 𝑤 7.5 w=7.5 and 30 steps. “CLIP” denotes CLIP similarity with the backbone of CLIP-ViT-Large-Patch/14. Exp. F is the final configuration for training JanusFlow. Exp. ID Model Setting Train. Iter. Evaluation Benchmarks REPA Und. Modules Gen. Modules Type POPE ↑ ↑ \uparrow VQAv2 ↑ v a l {}_{val}\uparrow GQA ↑ ↑ \uparrow FID ↓ ↓ \downarrow CLIP ↑ ↑ \uparrow A × \times SigLIP VAE † +ConvNeXt Unified 50,000 82.40 69.62 ...

4.5 Ablation Studies

原文: ined with representation alignment show notably lower FID scores on MJHQ dataset and higher CLIP scores, indicating simultaneous improvements in both image quality and semantic alignment. Importantly, our architecture differs from previous studies [ 70 , 65 ] examined in [ 99 ] due to our incorporation of LLM and an additional skip connection between g e n c subscript 𝑔 𝑒 𝑛 𝑐 g_{enc} and g d e c subscript 𝑔 𝑑 𝑒 𝑐 g_{dec} . The effectiveness of representation alignment in our modified architecture suggests its broad applicability and generalization capability across different network st...

4.5 Ablation Studies

原文: rmance gap between Exp. F and these task-specific baselines demonstrates that our unified framework successfully integrates understanding and generation capabilities without significant compromise in either task’s performance. Figure 5: Visual Understanding with JanusFlow. Our model effectively handles various visual understanding tasks, such as question answering, plot interpretation and object counting.

4.6 Qualitative Results

（4.6 Qualitative Results - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present qualitative evaluations of our method for both image generation and understanding tasks. Fig. 1(b) and Fig. 4 showcases the image generation capabilities of JanusFlow. These results demonstrate both the high visual quality of our generated images and our framework’s ability to faithfully follow diverse instructions. For multimodal understanding, Fig. 5 presents example conversations that show our model’s understanding capabilities across various scenarios. These interactions demonstrate the model’s ability to understand and reason about visual content in natural language dialogues. ...

5 Conclusion

（5 Conclusion - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We present JanusFlow, a unified framework that successfully harmonizes autoregressive and rectified flow models for multimodal understanding and generation tasks. Our extensive experiments demonstrate that this unification achieves comparable performance to task-specific models. The successful integration of these fundamentally different model architectures not only addresses current challenges in multimodal learning but also opens new possibilities for future research in training unified models.

Appendix A Performance Analysis of 256 Resolution Model

（Appendix A Performance Analysis of 256 R - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We trained our model at two resolutions: 256 × 256 256 256 256\times 256 and 384 × 384 384 384 384\times 384 . The main paper presents results from the 384 × 384 384 384 384\times 384 model as our primary results. Here, we provide a comprehensive evaluation of the 256 × 256 256 256 256\times 256 model’s performance. The visual understanding performances are presented in Tab. 1 . The generation capabilities are evaluated using GenEval [ 28 ] , DPG-Benchmark [ 34 ] , and MJHQ FID-30k [ 48 ] , with results shown in Tab. 2 and 3 . Table 1: Results on visual understanding tasks. Model LLM Params PO...

Appendix A Performance Analysis of 256 Resolution Model

原文: semantic manipulation.

Appendix B Analysis of CFG Factor and Sampling Steps

（Appendix B Analysis of CFG Factor and Sa - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: (a) Results of varying CFG Factors (b) Results of Varying Numbers of Sampling Steps Figure 1: Results of varying CFG factors and numbers of sampling steps . In Fig. (a), the number of sampling steps is set to 30. In Fig. (b), the CFG factor is set to 2. We investigate the impact of two key generation parameters: the Classifier-Free Guidance (CFG) factor and the number of sampling steps. While our main results use w = 2 𝑤 2 w=2 for CFG and 30 sampling steps to calculate FID, here we present a comprehensive analysis of these hyperparameters. Fig. 1(a) shows the effect of varying CFG factors whil...

Appendix C Additional Qualitative Results

（Appendix C Additional Qualitative Result - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: Figure 2: More multimodal understanding cases. Additional qualitative examples for both understanding and generation tasks are presented in Fig. 2 and Fig. 3 , respectively. The understanding examples demonstrate JanusFlow’s diverse capabilities, including code generation, person identification, character recognition, and visual reasoning. For image generation, our model exhibits strong performance in both visual quality and semantic alignment with input prompts. Figure 3: More text-to-image generation results.

Appendix D Details of REPA Ablation

（Appendix D Details of REPA Ablation - 详见原文） DeepSeek团队通过创新的架构设计和训练方法，在该领域取得了显著进展。模型在相关基准测试中表现出色，验证了这一方法的有效性。这一成果为开源AI社区做出了重要贡献，推动了技术发展。未来将继续优化和改进相关技术。

原文: We provide the FID and CLIP similarity of the first 50,000 training iterations of the pre-train stage in Fig. 4 with and without representation alignment regularization. The gap between the two models demonstrates the benefits of using representation alignment regularization. Figure 4: The FID and CLIP similarity during the first 50,000 iterations. ◄ Feeling lucky? Conversion report Report an issue View original on arXiv ►

← 返回首页详细解读