[原文]Janus is an autoregressive model, and we simply adopt the cross-entropy loss during training:
L = −
∑︁
𝑖=1
log 𝑃𝜃(𝑥𝑖|𝑥
3.4. Inference
在推理阶段,我们的模型采用下一词元预测方法。对于纯文本理解与多模态理解任务,我们遵循标准做法,从预测分布中依次采样词元。对于图像生成任务,我们采用了无分类器引导(Classifier-Free Guidance, CFG)[2],这与先前的工作[8, 26, 73]类似。具体而言,对于每个词元,其logit值 l_g 的计算公式为:l_g = l_u + s(l_c - l_u),其中 l_c 为条件logit,l_u 为无条件logit,s 为无分类器引导的缩放系数。在后续的评估中,s 的默认值设为 5。
[原文]During inference, our model adopts a next-token prediction approach. For pure text under-
standing and multimodal understanding, we follow the standard practice of sampling tokens
sequentially from the predicted distribution. For image generation, we utilize classifier-free
guidance (CFG) 2, similar to prior works [8, 26, 73]. Specifically, for each token, the logit 𝑙𝑔is
calculated as: 𝑙𝑔= 𝑙𝑢+ 𝑠(𝑙𝑐−𝑙𝑢), where 𝑙𝑐is the conditional logit, 𝑙𝑢is the unconditional logit,
and 𝑠is the scale for the classifier-free guidance. The default number of 𝑠is 5 for the following
evaluation.
3.5. Possible Extensions
需要指出的是,我们采用理解与生成独立编码器的设计结构简洁,且易于扩展。 多模态理解。(1)对于多模态理解组件,可以选择更强的视觉编码器,而无需顾虑该编码器是否具备处理视觉生成任务的能力,例如 EVA-CLIP [74]、InternViT [13] 等。(2)为处理高分辨率图像,可采用动态高分辨率技术 [50]。这使得模型能够扩展至任意分辨率,而无需对 ViT 进行位置嵌入插值。为进一步降低计算成本,可对 Token 进行压缩,例如使用像素重排(pixel shuffle)操作 [12]。 视觉生成。(1)在视觉生成方面,可选择更细粒度的编码器,以便在编码后保留更多图像细节,例如 MoVQGan [93]。(2)可采用专为视觉生成设计的损失函数,例如扩散损失 [46]。(3)在视觉生成过程中,可结合自回归(因果注意力)与并行(双向注意力)方法,以减少生成过程中的累积误差 [79]。 支持额外模态。Janus 简洁的架构便于集成额外的编码器,从而兼容多种模态,如 3D 点云 [53]、触觉 [88] 和脑电图(EEG)[4]。这使得 Janus 具备成为更强大的多模态通用模型的潜力。
[原文]It is important to note that our design, which features separate encoders for understanding and
generation, is straightforward and easy to extend.
Multimodal Understanding. (1) For the multimodal understanding component, a stronger
vision encoder can be chosen without worrying about whether the encoder is capable of handling
vision generation tasks, such as EVA-CLIP [74], InternViT [13], etc. (2) To handle high-resolution
images, dynamic high-resolution techniques [50] can be used. This allows the model to scale to
any resolution, without performing positional embedding interpolation for ViTs....
4. Experiments
在本节中,我们展示了一系列全面的实验,旨在评估我们的方法在多种视觉理解与生成任务上的性能。我们首先详细介绍实验设置,包括模型架构、训练数据集和评估基准。随后,我们报告了 Janus 的性能,并将其与其他最先进模型在多种多模态理解与生成基准上的表现进行了对比。我们还进行了广泛的消融实验,以验证所提方法的有效性。最后,我们提供了一些定性结果。 2 在训练过程中,我们以10%的概率将文生图数据中的文本条件替换为填充标记(pad token),从而使模型具备无条件视觉生成能力。
[原文]In this section, we present a series of comprehensive experiments designed to assess the perfor-
mance of our method across a range of visual understanding and generation tasks. We begin by
detailing our experimental setup, which includes the model architecture, training datasets, and
evaluation benchmarks. Next, we report the performance of Janus, followed by a comparison
with other state-of-the-art models on various benchmarks for multimodal understanding and
generation. We also conduct extensive ablation studies to verify the effectiveness of the proposed
method. Lastly, we provide some qua...
4.1. Implementation Details
在实验中,我们采用最大支持序列长度为 4096 的 DeepSeek-LLM (1.3B) [5] 作为基础语言模型。针对理解任务所使用的视觉编码器,我们选用 SigLIP-Large-Patch16-384 [92]。生成编码器的码本大小为 16,384,并将图像下采样 16 倍。理解适配器与生成适配器均为两层多层感知机(MLP)。各阶段的详细超参数详见表 1。所有图像均调整大小为 384 × 384 像素。对于多模态理解数据,我们将图像长边缩放至 384,并使用背景色(RGB: 127, 127, 127)对短边进行填充,使其达到 384。对于视觉生成数据,我们将图像短边缩放至 384,并将长边裁剪至 384。训练过程中,我们采用序列打包技术以提升训练效率。在单个训练步骤中,我们按指定比例混合所有数据类型。我们的 Janus 模型基于 HAI-LLM [32] 进行训练与评估,该框架是构建于 PyTorch 之上的轻量级高效分布式训练框架。整个训练过程在包含 16 个节点的集群上耗时 7 天,每个节点配备 8 块 Nvidia A100(40GB)GPU。
[原文]In our experiments, we utilize DeepSeek-LLM (1.3B) [5] with a maximum supported sequence
length of 4096 as the base language model. For the vision encoder used in understanding tasks,
we select SigLIP-Large-Patch16-384 [92]. The generation encoder has a codebook of size 16, 384
and downsamples images by a factor of 16. Both the understanding adaptor and the generation
adaptor are two-layer MLPs. The detailed hyperparameters for each stage are provided in
Table 1. All images are resized to 384 × 384 pixels. For multimodal understanding data, we
resize the long side of the image and pad the shor...
4.2. Data Setup
在本节中,我们详细介绍预训练和有监督微调数据集。阶段一。我们使用一个包含来自 ShareGPT4V [10] 的 125 万条图文配对描述的数据集用于多模态理解,以及来自 ImageNet-1k [18] 的约 120 万个样本用于视觉生成。ShareGPT4V 数据格式化为“”。ImageNet 数据利用类别名称组织为文生图数据格式:“”。此处,“<>”符号表示占位符。阶段二。我们将数据组织为以下几类。(1) 纯文本数据。我们使用来自 DeepSeek-LLM [5] 的预训练文本语料库。(2) 图文交错数据。我们使用 WikiHow [39] 和 WIT [72] 数据集。(3) 图像描述数据。我们使用来自 [17, 18, 23, 38, 40, 45, 47, 49, 70] 的图像。其中,我们采用开源多模态模型对 [17, 40] 中的图像重新生成描述。图像描述数据被格式化为问答对,例如:“详细描述该图像。”。(4) 表格与图表数据。我们使用来自 DeepSeek-VL [55] 的对应表格与图表数据。数据格式化为“”。(5) 视觉生成数据。我们利用来自多个数据集(包括 [17, 38, 40, 57, 58, 60, 63, 70])的图文配对数据,以及 200 万条内部数据。对于来自 [38, 70] 的图像,我们基于美学评分和图像尺寸进行过滤,最终保留 20%。在训练过程中,我们以 25% 的概率随机仅使用描述的第一句话,以
[原文]In this section, we provide details of the pretraining and supervised finetuning datasets. Stage I. We use a dataset that includes 1.25 million image-text paired captions from ShareGPT4V [10]
for multimodal understanding and approximately 1.2 million samples from ImageNet-1k [18] for
visual generation. The ShareGPT4V data is formatted as “”. The ImageNet data
is organized into a text-to-image data format using the category names: “”. Here, the “<>” symbols represent placeholders. Stage II. We organize the data into the following categories. (1) Text-only data...
4.2. Data Setup
旨在促使模型发展出对简短描述的强大生成能力。ImageNet 样本 [18] 仅在前 120K 个训练步骤中提供,而其他数据集的图像则出现在随后的 60K 个步骤中。正如文献 [9] 所建议的,该方法有助于模型首先学习基础的像素依赖关系,随后再过渡至更复杂的场景理解。视觉生成数据以“”的格式提供。第三阶段。针对文本理解,我们使用来自 [43] 的数据;针对多模态理解,我们使用来自 [31, 34, 35, 43, 56, 69] 的指令微调数据;针对视觉生成,我们使用来自 [17, 60, 70] 的图像-文本对(第二阶段数据的子集)以及 400 万条内部数据。我们采用以下格式进行指令微调:“User: \n Assistant: ”。对于多轮对话,我们通过重复该格式来构建数据结构。
[原文]Only
LLaVA-v1.5-Phi-1.5 [86]
1.3B
84.1
1128.0
-
-
75.3
56.5
30.7
-
MobileVLM [14]
1.4B
84.5
1196.2
53.2
-
-
56.1
-
-
MobileVLM-V2 [15]
1.4B
84.3
1302.8
57.7
-
-
59.3
-
-
MobileVLM [14]
2.7B
84.9
1288.9
59.6
-
-
59.0
-
-
MobileVLM-V2 [15]
2.7B
84.7
1440.5
63.2
-
-
61.1
-
-
LLaVA-Phi [96]
2.7B
85.0
1335.1
59.8
-
71.4
-
-
28.9
LLaVA [51]
7B
76.3
809.6
38.7
33.5
-
-
-
25.5
LLaVA-v1.5 [50]
7B
85.9
1510.7
64.3
58.6
78.5
62.0
35.4
31.1
InstructBLIP [16]
7B
-
-
36.0
53.4
-
49.2
-
26.2
Qwen-VL-Chat [3]
7B
-
1487.5
60.6
58.2
78.2
57.5
-
-
IDEFICS-9B [41]
8B
-
-
48.2
-
50.9
38.4
-
-
Emu3-Chat [83]
8B
85....
4.3. Evaluation Setup
多模态理解。为评估多模态理解能力,我们在广泛认可的基于图像的视觉-语言基准测试上对模型进行评估,这些基准包括 VQAv2 [31]、GQA [35]、POPE [48]、MME [25]、SEED [42]、MMB [54]、MM-Vet [90] 和 MMMU [91]。 视觉生成。为评估视觉生成能力,我们采用 MSCOCO-30K [11]、MJHQ-30K [44] 和 GenEval [30] 基准测试。MSCOCO-30K 和 MJHQ-30K 采用 Fréchet Inception Distance (FID) 指标,将生成图像与 3 万张高质量图像进行对比,以衡量图像生成的整体效能。GenEval 是一项具有挑战性的图像生成基准测试,旨在通过对视觉生成模型的组合能力进行详细的实例级分析,以反映其全面的生成能力。
[原文]Multimodal Understanding. To assess multimodal understanding capabilities, we evaluate our
model on widely recognized image-based vision-language benchmarks, which include VQAv2
[31], GQA [35], POPE [48], MME [25], SEED [42], MMB [54], MM-Vet [90], and MMMU [91].
Visual Generation. For evaluating visual generation capabilities, we use the MSCOCO-30K
[11], MJHQ-30K [44], and GenEval [30] benchmarks. MSCOCO-30K and MJHQ-30K employ
the Fréchet Inception Distance (FID) metric on generated images compared to 30K high-quality
images, which indicates the overall efficacy of image generation. GenEval ...
4.4. Comparison with State-of-the-arts
多模态理解性能。表2将本文提出的方法与最先进统一模型及纯理解模型进行了对比。在规模相近的模型中,Janus取得了整体最优的结果。具体而言,与先前最优的统一模型Show-o [86]相比,我们在MME和GQA数据集上的性能分别提升了41%(949→1338)和30%(48.7→59.1)。这归功于Janus将多模态理解与生成的视觉编码进行解耦,从而有效缓解了这两项任务之间的冲突。与规模显著更大的模型相比,Janus依然保持高度的竞争力。例如,在POPE、MMbench、SEED Bench和MM-Vet等多个数据集上,Janus的表现均优于LLaVA-v1.5 (7B)。 视觉生成性能。我们在GenEval、COCO-30K和MJHQ-30K基准测试上报告了视觉生成性能。如表3所示,我们的Janus在GenEval上取得了61%的整体准确率,优于先前最优的统一模型Show-o(53%)以及一些流行的纯生成方法,例如SDXL(55%)和DALL-E 2(52%)。这表明我们的方法具有更强的指令遵循能力。如表4所示,Janus在COCO-30K和MJHQ-30K基准测试上的FID得分分别为8.53和10.10,超越了统一模型Show-o和LWM,并与一些知名的纯生成方法相比展现出具有竞争力的性能。这表明Janus生成的图像质量良好,凸显了其在视觉生成领域的潜力。
[原文]Multimodal Understanding Performance. We compare the proposed method with state-of-the-
art unified models and understanding-only models in Table 2. Janus achieves the overall best
results among models of similar scale. Specifically, compared to the previous best unified model,
Show-o [86], we achieve performance improvements of 41% (949 →1338) and 30% (48.7 →59.1)
on the MME and GQA datasets, respectively. This can be attributed to Janus decoupling the
visual encoding for multimodal understanding and generation, mitigating the conflict between
these two tasks. When compared to models with sig...
4.5. Ablation Studies
我们精心设计了消融实验,以验证 Janus 设计理念的有效性。首先,我们设计了实验以验证解耦视觉编码的重要性与优势。其次,我们探讨了统一训练对多模态理解或视觉生成等单一任务的影响。结果列于表 5。 9 基线构建。遵循先前的工作 [77],我们选择一个 VQ 分词器 [73] 来为多模态理解和生成任务编码图像,作为基线(实验 A)。考虑到实验 A 中的 VQ 分词器在提取语义信息方面可能较弱,从而在多模态理解任务上效果不佳,我们还构建了一个更强的基线实验 B。我们采用 SigLIP 来蒸馏出一个增强的语义分词器 3,该分词器既能从图像中提取高层语义信息,又具备将图像转换为离散 ID 的能力,这与 [85] 中的方法类似。关于语义分词器的详细信息可在
[原文]We carefully design ablation studies to verify the effectiveness of Janus’s design concept. First,
we design experiments to validate the importance and benefits of decoupling visual encoding.
Second, we investigate the impact of unified training on individual tasks like multimodal
understanding or visual generation. Results are listed in Table 5.
9
Table 4 | Evaluation of text-to-image generation ability on MSCOCO-30K and MJHQ-30K
benchmark. “Und.” and “Gen.” denote “understanding” and “generation”, respectively.
Models using external pretrained diffusion model are marked with †.
Type
Model
#...
Appendix A.1.
附录 A.1. 解耦视觉编码的影响。(1) 从实验A(Exp-A)的结果来看,我们发现该模型在视觉生成基准测试上取得了令人满意的性能(在COCO数据集上FID为8.72)。然而,在理解基准测试上,实验A与我们的模型(实验D,Exp-D)之间存在显著差距。(2) 将实验B(Exp-B)与实验A进行比较时,结果显示在多模态理解方面有了明显改善,尽管与我们的方法相比仍存在相当大的差距。在视觉生成方面,实验B的表现优于实验D。我们对此假设了两个可能的原因。 3语义分词器仅在消融研究中作为更强的基线使用。为简化起见,我们在主实验中使用了普通的VQ分词器[73]。 10
[原文]Impact of Decoupling Visual Encoding. (1) From the results of Exp-A, we find the model
achieves satisfactory performance on visual generation benchmark (8.72 FID on COCO). How-
ever, there is a significant gap on understanding benchmarks between Exp-A and our model
(Exp-D). (2) When comparing Exp-B to Exp-A, the results show a clear improvement in multi-
3The semantic tokenizer is only used in the ablation study as a stronger baseline. For simplicity, we use the
ordinary VQ tokenizer [73] in the main experiment.
10
Janus (Ours)
A close-up high-contrast photo of Sydney Opera House sitting next...
Janus(我们的方法) 悉尼歌剧院紧邻埃菲尔铁塔的高对比度特写照片,背景是充满翻滚能量的蓝色夜空、爆炸般的黄色星辰以及辐射状的蓝色漩涡。罗马双面神雅努斯(Janus)的详细肖像,展现了他朝向相反方向的两张面孔。一张面孔显得苍老,深陷的皱纹和睿智沉思的表情;另一张面孔则年轻,散发着活力与好奇。他的头发呈流动的卷发造型,以神圣的对称感框住两张面孔。作品色彩对比丰富,左侧以冷蓝色和银色为主,象征冬季与沉思;右侧则沐浴在温暖的金色与红色中,代表春季与新生。背景是一幅天体挂毯,点缀着星辰以及象征时间与流逝的图案。一只长着金色羽毛的睿智老猫头鹰栖息在魔法森林中一棵发光的水晶树上。发光的萤火虫盘旋飞舞,空灵的薄雾在林间弥漫,在虹彩月光的漩涡和闪烁的翡翠绿叶的映照下熠熠生辉。一只穿着未来派太空服的勇敢小狗,在星尘沙丘和流星雨交错的异星行星上探索。景观中点缀着发光的水晶构造和空灵的地貌,营造出一种超现实的环境,天空中旋转的漩涡描绘了遥远星系无尽的舞蹈。 LlamaGen Janus(我们的方法) LlamaGen SDXL SDXL
Appendix A.1.
附录 A.1. 首先,语义分词器生成的离散ID具有更高的语义连贯性,为大语言模型提供了更合理的预测目标。其次,Exp-B中的视觉编码器参数量显著多于Exp-D中的生成(Gen.)编码器。(3)为了探究使用单一视觉编码器是否会导致多模态理解与生成之间的权衡,我们基于Exp-B进一步设计了Exp-C,该实验仅专注于多模态理解训练。Exp-C的多模态理解能力显著优于Exp-B。这表明Exp-B中的视觉编码器在多模态理解与生成之间进行了权衡,最终牺牲了其多模态理解能力。上述实验阐明了视觉编码解耦的重要性。 统一模型与纯理解及纯生成模型对比。我们对比了统一训练(Exp-D)与纯理解训练(Exp-E)及纯生成训练(Exp-F)的性能。对于纯理解训练,我们省略了视觉生成数据;对于纯生成训练,我们排除了理解数据。请注意,统一训练与纯理解训练在理解部分遵循相同的训练步骤。同理,统一训练与纯生成训练在视觉生成部分也遵循相同的训练步骤。实验结果表明,统一训练的性能与仅针对理解或仅针对视觉生成的训练性能相当。这证明我们的模型Janus能够在融入强大生成能力的同时,将对多模态理解性能的影响降至最低。 用户:从上到下,这个梗图的含义是什么? Janus(我们的模型):该梗图以幽默的方式诠释了诺贝尔物理学奖的概念。上半部分展示了一位身穿红色夹克、举起手的男子,神情似乎困惑或惊讶;下半部分则展示了同一位男子面带微笑,显得十分满意或自豪,仿佛刚刚获得了诺贝尔奖。该梗图巧妙地利用了这种对比:上半部分的男子可能对获得诺贝尔物理学奖的想法感到困惑或惊讶,而下半部分的男子则对此表现出真正的喜悦。
[原文]First, the semantic tokenizer produces discrete IDs that are more semantically coher-
ent, providing more reasonable prediction targets for the LLM.Second, the visual encoder in
Exp-B has significantly more parameters than the Gen. encoder in Exp-D. (3) To investigate
whether using a single visual encoder leads to a trade-off between multimodal understanding
and generation, we further design Exp-C based on Exp-B, which focuses solely on multimodal
understanding training. The multimodal understanding ability of Exp-C is significantly better
than that of Exp-B. This indicates that the visual enc...
Appendix A.1.
Chameleon:该梗图以视觉形式呈现了两人之间的对话,上方面板配有文字"I'm not arguing, I'm just explaining why I'm right"(我不是在争论,我只是在解释为什么我是对的),下方面板配有文字"No, you're not."(不,你不是。)该梗图运用了称为"反讽"(irony)的手法,旨在传达这样一种观点:尽管某人表面上装作不在争论,但实际上他们正在争论。使用"I'm not arguing, I'm just explaining why I'm right"这一表述,是那些试图避免争论但实际上却在参与争论的人常用的策略。该梗图通过展示此人的言行与其声称的内容直接相矛盾,对这种策略进行了调侃。 Show-o:该梗图从上方面板到下方面板的含义是,它以幽默轻松的方式描绘了一名戴着黄色帽子、穿着红色夹克的男子,配文为"I'm not a fan of the sun"(我不喜欢太阳)。这表明该男子不喜欢太阳,并且他将太阳作为自己感受或情绪的隐喻。该梗图可能旨在制造幽默和娱乐效果,因为它将严肃的主题与轻松活泼的表现手法相结合。
[原文]Chameleon: The meme is a visual representation of a conversation between two people, with the top
panel featuring the text "I'm not arguing, I'm just explaining why I'm right" and the bottom panel featuring
the text "No, you're not." The meme is using a technique called "irony" to convey the idea that even though
one person is presenting themselves as non-argumentative, they are actually being argumentative.The use
of the phrase "I'm not arguing, I'm just explaining why I'm right" is a common tactic used by people who
are trying to avoid an argument but are actually engaging in one. The meme i...
User:解释这个梗图。 Janus(Ours):该梗图比较了机器学习中使用的两种不同类型的神经网络。第一种称为"Decoupling Visual Encoding"(解耦视觉编码),第二种称为"Single Visual Encoder"(单一视觉编码器)。该梗图试图传达的观点是,第一种类型功能更强大,能够处理更复杂的任务,而第二种类型则更简单且功能较弱。
Chameleon:该图像展示了一幅猪和熊并排站立的卡通画,上方配有文字"When you're a vegan and your friends still order sausage and bacon"(当你是一名素食主义者,而你的朋友仍然点香肠和培根时)。该图像以幽默的方式刻画了素食主义者在非素食环境中面临的挑战,即朋友或家人可能无法认同其饮食偏好。 Show-o:该图像展示了两只并排站立的狗,一只是棕色的,一只是白色的。棕色的狗位于图像左侧,白色的狗位于右侧。两只狗都注视着镜头,似乎正在摆姿势拍照。棕色的狗体型略大于白色的狗。图像配文为"I'm not a dog, but I'm a dog."(我不是狗,但我是一只狗。)这句带有幽默色彩的表述与两只狗的图像相呼应,强化了其中的矛盾与趣味。
Appendix A.1.
附录 A.1. 我们将响应与 Chameleon-7B [77] 和 Show-o [86] 进行了对比。我们在响应中对关键点进行了强调。建议在屏幕上查看以获得最佳效果。
[原文]We compare
the response with Chameleon-7B [77] and Show-o [86].We emphasize the key-points in the
response. Best viewed on screen.
4.6. Qualitative Results
视觉生成可视化。图4展示了本模型与基于扩散的模型(如SDXL [62])以及自回归模型LlamaGen [73]之间的定性对比。结果表明,本模型在视觉生成方面展现出更优的指令遵循能力,能够精准捕捉用户提示词中的大部分细节。这印证了统一模型在视觉生成领域的潜力。更多可视化结果详见附录B。 迷因(MEME)图像的多模态理解。图5展示了Janus在多模态理解能力上的定性结果,并与Chameleon [77]和Show-o [86]进行了对比。Janus能够准确解析图像配文,并捕捉迷因所传达的情感。相比之下,Chameleon与Show-o均难以准确识别图像中的文本。此外,Chameleon未能识别迷因中的物体,而Show-o则误判了狗的颜色。这些案例凸显出,相较于Chameleon和Show-o采用的共享编码器,解耦视觉编码器显著增强了Janus的细粒度多模态理解能力。更多多模态理解示例可在
[原文]Visualizations of Visual Generation. Figure 4 provides qualitative comparisons between our
model, diffusion-based models like SDXL [62], and the autoregressive model LlamaGen [73].
12
The results show that our model demonstrates superior instruction-following capabilities in
visual generation, accurately capturing most of details in the user’s prompt. This indicates the
potential of the unified model in the realm of visual generation. More visualizations can be
found in the Appendix B.
Multimodal Understanding on MEME Images. Figure 5 showcases the qualitative results of
Janus’s multimodal un...
5. Conclusion
本文介绍了Janus,一种简单、统一且可扩展的多模态理解与生成模型。Janus的核心思想在于将用于多模态理解与生成的视觉编码进行解耦,从而缓解因理解与生成任务对视觉编码器需求不同而产生的冲突。大量实验验证了Janus的有效性与领先性能。此外值得注意的是,Janus具有良好的灵活性与易扩展性。除了在多模态理解与生成方面具备巨大的性能提升潜力外,Janus还易于扩展以支持更多输入模态。上述优势表明,Janus有望为下一代多模态通用模型的研发提供启发。
[原文]In this paper, we introduced Janus, a simple, unified and extensible multimodal understanding
and generation model. The core idea of Janus is to decouple visual encoding for multimodal
understanding and generation, which could alleviate the conflict arising from the differing
demands that understanding and generation place on the visual encoder. Extensive experi-
ments have demonstrated the effectiveness and leading performance of Janus. It is also worth
noting that Janus is flexible and easy to extend. In addition to having significant potential for
improvement in both multimodal understandin...
Appendix
附录 A. 消融研究中提及的语义分词器细节 A.1. 语义分词器架构 CNN编码器 向量量化 19 97 822 96 701 100 66 88 99 离散视觉Token 查找码本 语义解码器 像素解码器 预训练SigLIP 语义重建损失 RGB重建损失 CNN编码器 向量量化 19 97 822 96 701 100 66 88 99 VQ后的离散视觉Token 查找码本 语义解码器 像素解码器 LLM 19 97 822 96 701 100 66 88 99 LLM预测的离散ID (a) 语义分词器架构 (b) 集成语义分词器的LLM架构 适配器 A.2. 训练 训练流程。语义分词器采用两阶段方式从头训练。在第一阶段,我们在ImageNet-1k [18]数据集上训练模型40个epoch。在第二阶段,我们在5000万张图像上对模型进行1个epoch的微调。这些图像来源于Janus预训练过程中使用的视觉生成数据。我们使用恒定的学习率1e-4和批次大小128。训练损失。语义分词器的训练损失由两部分组成。一方面,我们采用文献[73]中描述的RGB重建损失。另一方面,我们使用SigLIP-Large-Patch16-384作为教师模型,监督语义解码器的语义特征重建结果。我们采用了BEiT-v2 [61]中的损失函数。
[原文]A. Details of Semantic Tokenizer Mentioned in Ablation Study
A.1. Architecture of Semantic Tokenizer
CNN
Encoder
Vector Quantization
19
97
822
96
701
100
66
88
99
Discrete Visual Tokens
Lookup Codebook
Semantic Decoder
Pixel Decoder
Pretrained SigLIP
Semantic Reconstruction Loss
RGB Reconstruction Loss
CNN
Encoder
Vector Quantization
19
97
822
96
701
100
66
88
99
Discrete Visual Tokens after VQ
Lookup Codebook
Semantic Decoder
Pixel Decoder
LLM
19
97
822
96
701
100
66
88
99
Discrete IDs from LLM Prediction
(a) Architecture of Semantic Tokenizer
(b) Architecture of LLM with Semantic Tokenizer I...
Appendix
附录 具体而言,我们最大化语义解码器预测的语义特征与 SigLIP 输出之间的余弦相似度。语义重建损失的权重设置为 0.25。 A.3. 与大语言模型的集成
[原文]Specifically, we maximize the cosine
similarity between the semantic feature predicted by the semantic decoder and the SigLIP
output.The weight for the semantic reconstruction loss is set to 0.25. A.3. Integrating with LLM
We present the integration of the semantic tokenizer and the LLM in Figure 6 (b). The image is
first transformed into continuous features through the CNN encoder, vector quantization and
the semantic decoder. Then, the LLM processes these features and generates predictions for the
image IDs. Finally, the pixel decoder converts these discrete IDs into RGB values. B. Additiona...
B. 补充定性结果 更多文生图可视化结果。图 7 展示了更多的文生图生成结果。显然,Janus 能够生成高质量且紧密遵循给定提示词的图像。我们进一步探索了模型的多语言文生图能力,如图 8 所示。令人惊喜的是,尽管我们的训练数据仅包含英文文生图样本,Janus 仍能处理其他语言的文生图任务。我们将这种多语言能力归因于基础大语言模型的内在特性。大语言模型首先将不同语言映射至统一的语义空间,使得 Janus 无需额外训练即可自然地执行文生图任务。 更多多模态理解结果。图 9 展示了补充的多模态理解结果。
Appendix
附录 Janus 在处理来自不同上下文的输入时展现出令人印象深刻的理解能力,彰显了其强大的性能。 一位年轻女性,外貌融合了 Lana Del Rey 与 Grimes 的特点,飘逸的冷色调长发,大理石纹理,虹彩光泽,少女漫画风格,前拉斐尔派,K-pop 风格,镀金,珍珠,纺丝,云朵,幽灵,发光水母,飘动的蝉翼纱,Alexander McQueen 风格,手工蕾丝,花卉刺绣,蛇皮纹理,戏剧性布光 一只穿着国王服饰的微型可爱小老鼠,位于城堡中,拟人化风格,Jean-Baptiste Monge 画风,柔和的电影级布光,8K,细节精致,肖像画,皮克斯风格角色,复古电影风格 一位美丽、曲线优美的海盗公主女神肖像,红发,精致华丽的服饰,加勒比海背景 + 户外 + 海洋,由 ArtGerm、Alphonse Mucha、Roberto Ferri、Ross Tran、皮克斯风格绘制,低角度拍摄,数字绘画,电影级边缘光,Unreal Engine 5,8K 一只可爱毛茸茸的胖土拨鼠在岩石堆上晒太阳,背景为雪山,远处是绿松石色的冰川湖,晴朗的蓝天,高度细节化,黄金时刻,自然光,Octane 渲染,Unreal Engine 史诗级 3D 肖像,白色金刚身穿由黑色水晶制成的机械装甲,装甲周围饰有金色花纹,对称体型,超写实,细节精致,光泽感,电影级质感,Unreal Engine,ArtStation 风格,Octane 渲染 终极腕表时间机器,超先进科技,全息显示,精密机械结构 一位来自喀布尔的惊艳公主,身着红白相间的传统服饰,蓝眼睛,棕发 一只微型可爱毛茸茸的浣熊宝宝,围着针织蓝围巾,斜靠在中世纪酒馆的桌边,手持咖啡杯,拟人化风格,Jean-Baptiste Monge 画风,柔和的电影级布光,8K,细节精致,肖像画,皮克斯风格角色,复古电影风格 一只经过赛博格增强的熊猫,更多赛博格元素,3D,4K,Unreal Engine,Chaos 20 由木材和玻璃构成的建筑参数化亭阁,带有有机腔体,被美丽的森林环绕。戏剧性场景,照片级真实感,超写实,光线追踪反射,8K 高清,细节精致,Frank Lloyd Wright 风格
[原文]Janus exhibits impressive comprehension abilities when handling inputs
from various contexts, showcasing its powerful capabilities.
15
a young woman, looks like mix of Lana Del Rey and
grimes, flowing cool colored hair, marbled, iridescent,
shoujo manga, pre-raphaelite, k-pop, gilded, pearl,
spun silk, clouds, ghost, glowing jellyfish, billowing
gossamer cloth, Alexander McQueen, handmade lace,
floral embroidery, snakeskin, dramatic lighting
Tiny cute adorable mouse dressed as a king in a
castle, anthropomorphic, Jean-Baptiste Monge, soft
cinematic lighting, 8k, intricate details, portrait,
P...
一杯热气腾腾的咖啡和一只插着大束春季鲜花的黄铜花瓶的真实照片,位于日出时的老橡木窗旁,细节精细,色彩丰富,使用尼康 Z6 相机和尼康尼克尔镜头拍摄,参数为 50mm f/5.6,ISO 100,快门速度 1/400 秒
Appendix
超高清 dtm HDR 8k 美丽的超现实象征主义,令人着迷的埃及女王克利奥帕特拉的视觉形象,全身,迷人的棕色眼睛,黑色头发和空灵的面部特征,散发着天界光环,超高清,逼真的色彩,完美的曝光,极其锐利的焦点,黄金比例,柔和的反射,散景效果,艺术摄影,电影级合成,真实,专业,Rorianai 风格 36k s1000
[原文]UHD dtm HDR 8k
Beautiful surreal symbolism the mesmerizing vision
of a Cleopatra Queen of Egypt , full body ,
mesmerizing brown eyes, black hair and ethereal
features, radiating celestial aura, super high
definition, true lifelike color, perfect exposure, razor
sharp focus, golden ratio, soft reflections, bokeh
effect, fine art photography, cinematic compositing,
authentic, professional by Rorianai style 36k s1000
Figure 7 | More text-to-image generation results.We upsample the images to 1024 × 1024 for
better visualization.
16
LlamaGen
Janus (Ours)
A
realistic
photo
of
snowy mountains with
s...
16 LlamaGen Janus(我们的模型) 一张雪山写实照片,阳光照耀下的山峰在晴朗的蓝天下熠熠生辉。 这是一张雪山写实照片,阳光照耀下的山峰在晴朗的蓝天下熠熠生辉。前景中高大的常青树被白雪覆盖,为这令人惊叹的高山景色增添了深度。 Une photo réaliste de montagnes enneigées dont les sommets éclairés par le soleil scintillent sous un ciel bleu limpide. Au premier plan, de grands arbres à feuilles persistantes saupoudrés de neige ajoutent de la profondeur à cette scène alpine à couper le souffle. 一张带有❄⛰的写实照片,在晴朗的蓝天下有☀。前景中覆盖着薄雪的🌲为这令人惊叹的高山景色增添了深度。 澄み切った青空の下、太陽に照らされた山頂が輝く雪山のリアルな写真。前景の雪に覆われた背の高い常緑樹が、息をのむようなアルプスの風景に深みを与えている。
Appendix
该图表分为四个部分,每个部分代表一种不同类型的动物及其对应的百分比。 以下是生成该饼图的代码: ```python import matplotlib.pyplot as plt # Data animals = [ {"name": "Hogs", "frequency": 30.0}, {"name": "Frog", "frequency": 15.0}, {"name": "Logs", "frequency": 10.0}, {"name": "Dogs", "frequency": 45.0} ] # Create the pie chart plt.pie(animals, labels=animals, autopct='%1.1f%%', startangle=90) # Set the title and labels plt.title('Distribution of Animals') plt.title.set_text('Frequency of Animals') # Show the plot plt.show() ``` 这幅图像是一件有趣的艺术作品,因为它将一幅著名画作——《蒙娜丽莎》(Mona Lisa)——与一只狗结合在一起。这只狗以与蒙娜丽莎相同的姿态呈现,但具有不同的表情和不同的背景。该画作融合了独特的艺术风格与主题,因而格外引人注目。
[原文]The chart is divided
into four segments, each representing a different type
of animal and its corresponding percentage.Here's the code to generate the pie chart:
```python
import matplotlib.pyplot as plt
# Data
animals = [
{"name": "Hogs", "frequency": 30.0},
{"name": "Frog", "frequency": 15.0},
{"name": "Logs", "frequency": 10.0},
{"name": "Dogs", "frequency": 45.0}
]
# Create the pie chart
plt.pie(animals, labels=animals, autopct='%1.1f%%',
startangle=90)
# Set the title and labels
plt.title('Distribution of Animals')
plt.title.set_text('Frequency of Animals')
# Show the plot
plt.show()
```
...
References
[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[2] Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic.com, 2024.
[3] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[4] Y. Bai, X. Wang, Y.-p. Cao, Y. Ge, C. Yuan, and Y. Shan. Dreamdiffusion: Generating high-quality images from brain eeg signals. arXiv preprint arXiv:2306.16934, 2023.
[5] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
[6] T. B. Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[7] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第11315–11325页,2022年。
[8] H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
[9] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
[10] L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
[11] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
[12] Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
[13] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.
References
载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第24185–24198页,2024年。
[原文]In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 24185–24198, 2024.
[14] X.Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices.
arXiv preprint arXiv:2312.16886, 2023.
18
[15] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint
arXiv:2402.03766, 2024.
[16] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wa...
[14] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
18
[15] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y. Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
[16] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
[17] dclure. Laion-aesthetics-umap. https://huggingface.co/datasets/dclure/laion-aesthetics-12m-umap, 2022.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 载于 2009 IEEE 计算机视觉与模式识别会议,第248–255页。IEEE,2009年。
[19] J. Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[20] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[21] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
[22] A. Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[23] Echo840. Detailed caption dataset. https://huggingface.co/datasets/echo840/Detailed_Caption, 2023.
[24] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第12873–12883页,2021年。
[25] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[26] O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. 载于 European Conference on Computer Vision,第89–106页。Springer,2022年。
[27] Y. Ge, Y. Ge, Z. Zeng, X. Wang, and Y. Shan.
References
Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023.
[原文]Planting a seed of vision in large language
model. arXiv preprint arXiv:2307.08041, 2023.
[28] Y.Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan. Making llama see and draw
with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023.
[29] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal
models with unified multi-granularity comprehension and generation. arXiv preprint
arXiv:2404.14396, 2024.
[30] D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for
evaluating text-to-image alignment. Advances in Neural Informatio...
[28] Y. Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023.
[29] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024.
[30] D. Ghosh, H. Hajishirzi, and L. Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
19
[31] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. 载于 IEEE 计算机视觉与模式识别会议论文集,第6904–6913页,2017年。
[32] High-flyer. Hai-llm: Efficient and lightweight training tool for large models, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm.
[33] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[34] Y.-C. Hsiao, F. Zubach, M. Wang, et al. Screenqa: Large-scale question-answer pairs over mobile app screenshots. arXiv preprint arXiv:2209.08199, 2022.
[35] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第6700–6709页,2019年。
[36] Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
[37] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. Scaling up gans for text-to-image synthesis. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第10124–10134页,2023年。
[38] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. 载于 IEEE/CVF 国际计算机视觉会议论文集,第4015–4026页,2023年。
[39] M. Koupaee and W. Y. Wang. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018.
[40] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M.
References
Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
[原文]Malloci, A.Kolesnikov, T. Duerig, and V. Ferrari. The open images dataset v4: Unified
image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
[41] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti,
A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of
state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/id
efics.
[42] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal
llms with generative comprehension. arXiv preprint ...
[41] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, and et al. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/idefics.
[42] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[43] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li, Z. Liu, and C. Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024.
[44] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024.
[45] L. Li, Y. Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024.
20
[46] T. Li, Y. Tian, H. Li, M. Deng, and K. He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.
[47] X. Li, F. Zhang, H. Diao, Y. Wang, X. Wang, and L.-Y. Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception. arXiv preprint arXiv:2407.08303, 2024.
[48] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
[49] Z. Li, X. Yang, K. Choi, W. Zhu, R. Hsieh, H. Kim, J. H. Lim, S. Ji, B. Lee, X. Yan, et al. Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension. arXiv preprint arXiv:2407.04903, 2024.
[50] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第26296–26306页,2024年。
[51] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[52] H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024.
[53] M. Liu, R. Shi, K. Kuang, Y. Zhu, X. Li, S. Han, H. Cai, F.
References
Porikli, and H. Su. Openshape: Scaling up 3d shape representation towards open-world understanding. Advances in neural information processing systems, 36, 2024.
[原文]Porikli, and H.Su. Openshape:
Scaling up 3d shape representation towards open-world understanding. Advances in
neural information processing systems, 36, 2024.
[54] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mm-
bench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281,
2023.
[55] H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, Y. Sun,
et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint
arXiv:2403.05525, 2024.
[56] P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W....
[54] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mm-bench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[55] H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, Y. Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
[56] P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021.
[57] madebyollin. Megalith-huggingface. https://huggingface.co/datasets/madebyollin/megalith-10m, 2024.
[58] mehdidc. Yfcc-huggingface. https://huggingface.co/datasets/mehdidc/yfcc15m, 2024.
[59] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[60] J. Pan, K. Sun, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, J. Dai, Y. Qiao, and H. Li. Journeydb: A benchmark for generative image understanding, 2023.
[61] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
21
[62] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[63] ProGamerGov. Dalle3-high-quality-captions. https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions, 2024.
[64] A. Radford. Improving language understanding by generative pre-training. 2018.
[65] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. 载于 International conference on machine learning,第8821–8831页。PMLR,2021年。
[66] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[67] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer.
References
High-resolution image synthesis with latent diffusion models. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第10684–10695页,2022年。
[原文]High-resolution image
synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 10684–10695, 2022.
[68] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gon-
tijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models
with deep language understanding. Advances in neural information processing systems,
35:36479–36494, 2022.
[69] S. Shah, A. Mishra, N. Yadati, and P. P. Talukdar. Kvqa: Knowledge-aware visual question
answering. In Proceedings of the AAAI co...
[68] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
[69] S. Shah, A. Mishra, N. Yadati, and P. P. Talukdar. Kvqa: Knowledge-aware visual question answering. 载于 AAAI 人工智能会议论文集,第33卷,第8876–8884页,2019年。
[70] V. Singla, K. Yue, S. Paul, R. Shirkavand, M. Jayawardhana, A. Ganjdanesh, H. Huang, A. Bhatele, G. Somepalli, and T. Goldstein. From pixels to prose: A large dataset of dense image captions. arXiv preprint arXiv:2406.10328, 2024.
[71] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[72] K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. 载于第44届国际 ACM SIGIR 信息检索研究与发展会议论文集,第2443–2449页,2021年。
[73] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
[74] Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
[75] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
[76] Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Y. Wang, Y. Rao, J. Liu, T. Huang, and X. Wang. Generative multimodal models are in-context learners. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第14398–14409页,2024年。
22
[77] C. Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
[78] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[79] K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang.
References
Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
[原文]Visual autoregressive modeling: Scalable
image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
[80] H.Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language
models. arXiv preprint arXiv:2302.13971, 2023.
[81] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023.
[82] W. Wang, Z. Chen, X. Chen, ...
[80] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[81] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[82] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
[83] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024.
[84] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
[85] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024.
[86] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024.
[87] Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, and P. Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems, 36, 2024.
[88] F. Yang, C. Ma, J. Zhang, J. Zhu, W. Yuan, and A. Owens. Touch and go: Learning from human-collected vision and touch. arXiv preprint arXiv:2211.12498, 2022.
[89] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, et al. Magvit: Masked generative video transformer. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第10459–10469页,2023年。
[90] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[91] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al.
References
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. 载于 IEEE/CVF 计算机视觉与模式识别会议论文集,第9556–9567页,2024年。 23
[原文]Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark
for expert agi.In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 9556–9567, 2024.
23
[92] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-
training. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 11975–11986, 2023.
[93] C. Zheng, T.-L. Vuong, J. Cai, and D. Phung. Movq: Modulating quantized vectors for
high-fidelity image generation. Advances in Neural Information Processing Systems, 35:
23412–23425...
[92] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. 载于 IEEE/CVF 国际计算机视觉会议论文集,第11975–11986页,2023年。
[93] C. Zheng, T.-L. Vuong, J. Cai, and D. Phung. Movq: Modulating quantized vectors for high-fidelity image generation. Advances in Neural Information Processing Systems, 35:23412–23425, 2022.
[94] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039, 2024.
[95] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[96] Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.