摘要 我们提出了 JanusFlow,这是一个强大的框架,能够在单一模型中统一图像理解与生成任务。JanusFlow 引入了一种极简架构,将自回归语言模型与整流流相结合,后者是生成建模领域的最先进方法。我们的核心发现表明,整流流可以直接在大型语言模型框架内进行训练,无需进行复杂的架构修改。为了进一步提升统一模型的性能,我们采用了两项关键策略:(i) 解耦理解与生成编码器,以及 (ii) 在统一训练过程中对齐其表征。大量实验表明,JanusFlow 在各自领域内达到了与专用模型相当或更优的性能,同时在标准基准测试中显著优于现有的统一方法。本工作为构建更高效、更通用的视觉-语言模型迈出了重要一步。
[原文]We present JanusFlow, a powerful framework that unifies image understanding and generation
in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive
language models with rectified flow, a state-of-the-art method in generative modeling. Our
key finding demonstrates that rectified flow can be straightforwardly trained within the large
language model framework, eliminating the need for complex architectural modifications.
To further improve the performance of our unified model, we adopt two key strategies: (i)
decoupling the understanding and generation enco...
[原文]Large language models (LLMs) have demonstrated remarkable capabilities in learning diverse
knowledge and generalizing to new scenarios [1, 7, 8, 69, 91]. Leveraging these capabilities,
researchers have developed sophisticated models specialized in image comprehension [2, 15, 47,
49, 56, 58] and text-to-image generation [23, 73, 76, 79]. The field has recently shifted toward creating unified systems capable of handling both
tasks simultaneously. One prominent direction involves utilizing pre-trained text-to-image
models for high-quality generation while training LLMs to generate conditions for ...
[原文]To optimize JanusFlow’s performance,
we implement two key strategies: First, we maintain separate vision encoders for understanding
and generation tasks, preventing task interference and thus enhancing comprehension capabili-
ties.Second, we align the intermediate representations between generation and understanding
modules during training, strengthening semantic coherence in the generation process. JanusFlow shows state-of-the-art performances in both multimodal comprehension and
text-to-image generation compared to existing unified approaches, and even outperforms
several specialized methods...
[原文]Visual Generation with Flow-based Generative Models.
Recent years have witnessed re-
markable progress in visual generation through diffusion models [32, 83], leading to impressive
models like [67, 73, 76–79]. Building on these advances, flow-based generative models [3, 55, 61]
emerged as a simplified alternative framework. These approaches have recently enabled ad-
vanced visual generation models [23, 36] that achieve superior empirical performance with
faster sampling. Our work demonstrates that rectified flow [60–62] can be effectively integrated
into LLMs, creating unified models that exce...
[原文]Multimodal LLMs. Given a dataset D containing discrete token sequences, each of which
can be formulated as 𝑥= (𝑥1, · · · , 𝑥ℓ), large language models (LLMs) are trained to model the
sequence distribution in an autoregressive manner,
log P𝜃𝐿𝐿𝑀(𝑥) =
ℓ−1
∑︁
𝑖=0
log P𝜃𝐿𝐿𝑀(𝑥𝑖+1|𝑥1, . . . , 𝑥𝑖),
(1)
where 𝜃𝐿𝐿𝑀denotes the parameters of the LLM and ℓis the sequence length. After being
trained on large-scale datasets, LLMs exhibit the ability to generalize across various tasks and
follow diverse instructions [1, 8, 69]. To extend these models to handle visual inputs, LLMs
are augmented with vision enco...
[原文]The network is trained by minimizing the
Euclidean distance between the neural velocity and the directions of linear paths connecting
random points from 𝜋0 and 𝜋1,
min
𝜃
E𝑡∼P(𝑡),𝑧0∼𝜋0,𝑥∼𝜋1
�����𝑣𝜃𝑁𝑁(𝑧𝑡, 𝑡) −(𝑥−𝑧0)
����2�
, where 𝑧𝑡= 𝑡𝑥+ (1 −𝑡)𝑧0.
(3)
Here, P(𝑡) is a distribution over time 𝑡∈[0, 1].When the network has sufficient capacity and the
objective is perfectly minimized, the optimal velocity field 𝑣𝜃∗
𝑁𝑁maps the elementary distribution
𝜋0 to the true data distribution 𝜋1. More precisely, the distribution of 𝑧1 =
∫1
0 𝑣𝜃∗
𝑁𝑁(𝑧𝑡, 𝑡)d𝑡,
with 𝑧0 ∼𝜋0, follows 𝜋1. Despite its conceptual simp...
3.2. A Unified Framework for Multimodal Understanding and Generation
[原文]JanusFlow presents a unified framework designed to address both vision understanding and
image generation tasks. Next we outline how JanusFlow handles these two tasks within a single
LLM architecture. Multimodal Understanding. In multimodal understanding tasks, the LLM processes an input
sequence consisting of interleaved text and image data. The text is tokenized into discrete
tokens, each of which is transformed into an embedding of dimension 𝐷𝑒𝑚𝑏. For the images, an
image encoder 𝑓𝑒𝑛𝑐encodes each image 𝑥𝑖𝑚into a feature map of shape 𝐻𝑖𝑚× 𝑊𝑖𝑚× 𝐷𝑒𝑛𝑐. This
feature map is flattened and projecte...
[原文]To enhance
generation quality, we employ classifier-free guidance (CFG) when computing the velocity:
𝑣(𝑧𝑡, 𝑡) = 𝑤𝑣(𝑧𝑡, 𝑡| 𝑥𝑐𝑜𝑛) + (1 −𝑤)𝑣(𝑧𝑡, 𝑡| ∅),
(5)
where 𝑣(𝑧𝑡, 𝑡| ∅) denotes the velocity inferred without text conditioning and 𝑤⩾1 controls the
magnitute of CFG.Empirically, increasing 𝑤yields higher semantic alignment [23, 62, 73, 77]. Analogous to multimodal understanding, we prepend the special token |BOI| to indicate the
start of image generation in the sequence. Decoupling Encoders for the Two Tasks. Previous approaches that unify autoregressive
generation and diffusion models within a ...
[原文]As illustrated in Fig. 3, we train our model in three sequential stages, detailed below.
Stage 1: Adaptation of Randomly Initialized Components.
In the first stage, we focus on
training only the randomly initialized components: the linear layers, generation encoder, and
5
LLM
Und. Enc. 𝑓𝑒𝑛𝑐
Linear
Gen. Enc. 𝑔𝑒𝑛𝑐
VAE Enc.
Gen. Dec. 𝑔𝑑𝑒𝑐
Text De-Token.
Stage 1
Adaptation
LLM
Und. Enc. 𝑓𝑒𝑛𝑐
Linear
Gen. Enc. 𝑔𝑒𝑛𝑐
VAE Enc.
Gen. Dec. 𝑔𝑑𝑒𝑐
Text De-Token.
Stage 2
Unified Pre-Training
LLM
Und. Enc. 𝑓𝑒𝑛𝑐
Linear
Gen. Enc. 𝑔𝑒𝑛𝑐
VAE Enc.
Gen. Dec. 𝑔𝑑𝑒𝑐
Text De-Token.
Stage 3
Supervised Fine-Tuning
Figure ...
[原文]Training JanusFlow involves two types of data, multimodal understanding data and image
generation data. Both types of data contain two parts: “condition” and “response”. “Condition”
refers to the prompting of the tasks (e.g., text prompts in the task of generation and images in
the task of understanding) while “response” refers to the corresponding responses of the two
tasks. The data can be formatted as 𝑥= (𝑥𝑐𝑜𝑛, 𝑥𝑟𝑒𝑠), where the superscript 𝑐𝑜𝑛denotes “condition”
and 𝑟𝑒𝑠denotes “response”. We denote the length of the whole sequence 𝑥as ℓ, the length of
𝑥𝑐𝑜𝑛as ℓ𝑐𝑜𝑛and the length of 𝑥𝑟𝑒𝑠as ℓ𝑟𝑒...
[原文]Our decoupled vision encoder design enables efficient imple-
mentation of this alignment as a regularization term.Specifically, for generation tasks, we align
features from the understanding encoder 𝑓𝑒𝑛𝑐with the LLM’s intermediate features,
L𝑅𝐸𝑃𝐴(𝜃, 𝜑) = −E𝑥∼D𝑔𝑒𝑛
�
sim �stop_grad( 𝑓𝑒𝑛𝑐(𝑥𝑟𝑒𝑠)), ℎ𝜑(𝑞𝜃(𝑧𝑡))��
,
(8)
where 𝑞𝜃(𝑧𝑡) denotes an intermediate LLM representation given input 𝑧𝑡, and ℎ𝜑is a small
trainable MLP that projects 𝑞𝜃(𝑧𝑡) to dimension 𝐷𝑒𝑛𝑐. The function sim(·, ·) computes the mean
of element-wise cosine similarity between embeddings. Before computing the loss, we reshape
ℎ𝜑(𝑞𝜃(𝑧𝑡))...
[原文]We conduct extensive experiments to evaluate the capabilities of JanusFlow in both multimodal
understanding and generation tasks. First, we describe our experimental setup and implementa-
tion details. Then, we present results on standard benchmarks for multimodal understanding
and image generation. Finally, we perform ablation studies to validate our key design choices.
[原文]Our framework builds upon an enhanced version1 of DeepSeek-LLM (1.3B) [7, 64]. The LLM
consists of 24 transformer blocks and supports a sequence length of 4, 096. In our model, both
understanding and generation exploits images of resolution 384.
For multimodal understanding, we leverage SigLIP-Large-Patch/16 [106] as 𝑓𝑒𝑛𝑐. For image
generation, we utilize the pre-trained SDXL-VAE [73] for its latent space. The generation encoder
𝑔𝑒𝑛𝑐comprises a 2 × 2 patchify layer followed by two ConvNeXt [96] blocks and a linear layer.
The generation decoder 𝑔𝑑𝑒𝑐combines two ConvNeXt blocks, a pixel-shuffle ...
[原文]We follow Janus [97] to construct the training data. The data configuration for each training
stage is listed below.
Data for Stage 1 and Stage 2.
The first two stages of our framework uses three types of data:
multimodal understanding data, image generation data and text-only data.
1. Multimodal Understanding Data. This type of data contains several sub-categories: (a)
图像描述数据。我们整合了来自 [20, 41, 50, 51, 53, 82] 的 caption 数据集,并使用开源多模态理解模型为来自 [16, 43] 的图像生成了额外的 caption。数据集名称详见补充材料。数据遵循模板格式,例如:"Generate the caption of this picture.
[原文]Image caption data. We incorporate caption datasets from [20, 41, 50, 51, 53, 82] and
generate additional captions for images from [16, 43] using open-source multimodal
understanding models. The names of the datasets are provided in the supplementary
materials. The data follows template formats, e.g., “Generate the caption
of this picture. ”. (b) Charts and tables. We directly adopt the chart
and table data from the training data of DeepSeek-VL [64]. (c) Task data. ShareGPT4V
[11] data is utilized to facilitate basic question-answering capabilities during pre-training,
1This ve...
[原文]Image Generation.
We evaluate the generated images using both visual quality and semantic
accuracy metrics. For visual quality assessment, we employ the Fréchet Inception Distance
[30] (FID) metric and compute FID between 30,000 generated images and their corresponding
reference images from the MJHQ dataset [48]. The FID computation follows the implementation
from GigaGAN [39]. To evaluate semantic accuracy, we utilize two specialized frameworks:
GenEval [28] and DPG-Bench [34]. These frameworks are designed to assess whether the
9
Table 3 | Performances on DPG-Bench. The methods in this tabl...
[原文]Table 4 | Results of MJHQ FID-
30k. The models which have
similar scales to our model are
marked with blue background. JanusFlow achieves the best FID
among 1.3B models. Method
Params
FID↓
LWM [59]
7B
17.77
VILA-U 256 [99]
7B
12.81
VILA-U 384 [99]
7B
7.69
Show-o [100]
1.3B
15.18
Janus [97]
1.3B
10.10
JanusFlow (Ours)
1.3B
9.51
Image Generation Performances. We report the perfor-
mances on GenEval, DPG-Bench and MJHQ FID-30k. In
Tab. 2, we give comparisons on GenEval including the scores
of all the sub-tasks and the overall score. JanusFlow achieves
an overall score of 0.63, surpassing the prev...
[原文]We conduct comprehensive ablation studies to validate the effectiveness of our key design
choices. For computational efficiency, all ablation experiments are performed on 256 × 256
resolution images2. All models are trained on our unified pre-training dataset for 50, 000
iterations, except for the understanding-only and generation-only variants, which are trained
for proportionally fewer iterations based on their respective data ratios in the pre-training phase. The quantitative results of these ablation studies are presented in Tab. 6. Impact of Representation Alignment. The comparison betwee...
4.5. Ablation Studies
我们改进架构中表征对齐的有效性表明,其在不同网络结构中具有广泛的适用性与泛化能力。
[原文]The effectiveness of representation alignment in our modified architecture
suggests its broad applicability and generalization capability across different network structures.Impact of Decoupling Visual Encoders. e efficacy of using powerful pre-trained visual encoders
in multimodal understanding. The comparison among Exp. B, C, and F demonstrates the
advantages of using separate visual encoders for understanding and generation tasks. In
Exp. B, following a design similar to Transfusion [108], we implement shared ConvNeXt
blocks in the SDXL-VAE latent space for both understanding and generation...
[原文]We present qualitative evaluations of our method for both image generation and understanding
tasks. Fig. 1(b) and Fig. 4 showcases the image generation capabilities of JanusFlow. These results
demonstrate both the high visual quality of our generated images and our framework’s ability
12
User: What are the kinds of fruits in this picture?
JaunsFlow (Ours): The fruits in the picture are banana,
strawberry, mango, persimmon, blueberry, and lime.
User: Describe this image.
JaunsFlow (Ours): Kid's
Favourite Fruits is a bar graph.
The x-axis shows the fruits. The
y-axis shows the number of kids
wh...
[原文]We present JanusFlow, a unified framework that successfully harmonizes autoregressive and
rectified flow models for multimodal understanding and generation tasks. Our extensive
experiments demonstrate that this unification achieves comparable performance to task-specific
models. The successful integration of these fundamentally different model architectures not
only addresses current challenges in multimodal learning but also opens new possibilities for
future research in training unified models.
13
References
[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 技术报告. arXiv 预印本 arXiv:2303.08774, 2023.
[2] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo:面向少样本学习的视觉语言模型. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2022.
[3] M. Albergo and E. Vanden-Eijnden. 基于随机插值构建归一化流. 载于 Proc. Int'l Conf. Learning Representations, 2023.
[4] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL:具备多功能的前沿大型视觉语言模型. arXiv 预印本 arXiv:2308.12966, 2023.
[5] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu. 万物皆可为词:扩散模型的 ViT 骨干网络. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2023.
[6] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. 利用更优质的图像描述提升图像生成质量. 计算机科学, 2023.
[7] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. DeepSeek LLM:以长期主义扩展开源语言模型. arXiv 预印本 arXiv:2401.02954, 2024.
[8] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. 通用人工智能的火花:GPT-4 的早期实验. arXiv 预印本 arXiv:2303.12712, 2023.
[9] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al. PixArt-alpha:用于照片级真实感文本到图像合成的扩散 Transformer 快速训练. arXiv 预印本 arXiv:2310.00426, 2023.
[10] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. PixArt-Sigma:用于 4K 文本到图像生成的扩散 Transformer 由弱到强训练. arXiv 预印本 arXiv:2403.04692, 2024.
[11] L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. ShareGPT4V:利用更优质的图像描述提升大型多模态模型. arXiv 预印本 arXiv:2311.12793, 2023.
[12] X. Chu, L. Qiao, X. Lin, S. Xu, Y. Yang, Y. Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, et al. MobileVLM:面向移动设备的快速、可复现且强大的视觉语言助手. arXiv 预印本 arXiv:2312.16886, 2023.
[13] X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y.
References
Yang, X. Sun, Y. Hu, X. Lin, B. Zhang, et al. MobileVLM V2:更快更强的视觉语言模型基线. arXiv 预印本 arXiv:2402.03766, 2024.
[原文]Yang, X.Sun, Y. Hu, X. Lin, B. Zhang, et al. MobileVLM V2: Faster and stronger baseline for vision language model. arXiv preprint
arXiv:2402.03766, 2024.
[14] K. Crowson, S. A. Baumann, A. Birch, T. M. Abraham, D. Z. Kaplan, and E. Shippole. Scal-
able high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Proc. Int’l Conf. Machine Learning, 2024.
14
[15] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP:
Towards general-purpose vision-language models with instruction tuning. In Proc. Annu. Conf. Neural Inf. Process...
[14] K. Crowson, S. A. Baumann, A. Birch, T. M. Abraham, D. Z. Kaplan, and E. Shippole. 基于沙漏扩散 Transformer 的可扩展高分辨率像素空间图像合成. 载于 Proc. Int'l Conf. Machine Learning, 2024.
[15] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP:通过指令微调迈向通用视觉语言模型. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2023.
[18] P. Dhariwal and A. Nichol. 扩散模型在图像合成上超越 GAN. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2021.
[19] R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, et al. DreamLLM:协同多模态理解与生成. 载于 Proc. Int'l Conf. Learning Representations, 2024.
[21] B. Egan, A. Redden, XWAVE, and SilentAntagonist. DALLE-3 100 万+高质量图像描述, 2024. URL https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions.
[22] P. Esser, R. Rombach, and B. Ommer. 驾驭 Transformer 用于高分辨率图像合成. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2021.
[23] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. 扩展整流流 Transformer 用于高分辨率图像合成. 载于 Proc. Int'l Conf. Machine Learning, 2024.
[24] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun, Y. Wu, and R. Ji. MME:多模态大语言模型的综合评估基准. arXiv 预印本 arXiv:2306.13394, 2024.
[25] Y. Ge, Y. Ge, Z. Zeng, X. Wang, and Y. Shan. 在大语言模型中播下视觉的 SEED. arXiv 预印本 arXiv:2307.08041, 2023.
[26] Y. Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan. 借助 SEED 分词器让 LLaMA 具备视觉与绘图能力. arXiv 预印本 arXiv:2310.01218, 2023.
[27] Y. Ge, S. Zhao, J. Zhu, Y. Ge, K. Yi, L. Song, C. Li, X. Ding, and Y. Shan.
[原文]SEED-X: Multimodal
models with unified multi-granularity comprehension and generation. arXiv preprint
arXiv:2404.14396, 2024.
[28] D.Ghosh, H. Hajishirzi, and L. Schmidt. GenEval: An object-focused framework for
evaluating text-to-image alignment. In Proc. Annu. Conf. Neural Inf. Process. Systems,
2024.
[29] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in VQA matter:
Elevating the role of image understanding in visual question answering. In Proc. IEEE
Int’l Conf. Computer Vision and Pattern Recognition, 2017.
[30] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,...
[28] D. Ghosh, H. Hajishirzi, and L. Schmidt. GenEval:一种面向对象的文本到图像对齐评估框架. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[29] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. 让 VQA 中的“V”真正发挥作用:提升图像理解在视觉问答中的作用. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2017.
[30] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. 采用双时间尺度更新规则训练的 GAN 收敛于局部纳什均衡. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2017.
[32] J. Ho, A. Jain, and P. Abbeel. 去噪扩散概率模型. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2020.
[33] Y.-C. Hsiao, F. Zubach, G. Baechler, V. Carbune, J. Lin, M. Wang, S. Sunkara, Y. Zhu, and J. Chen. ScreenQA:基于移动应用截图的大规模问答对数据集. arXiv 预印本 arXiv:2209.08199, 2022.
[34] X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu. ELLA:为扩散模型配备大语言模型以实现增强的语义对齐. arXiv 预印本 arXiv:2403.05135, 2024.
[35] D. A. Hudson and C. D. Manning. GQA:用于现实世界视觉推理与组合问答的新数据集. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2019.
[36] Y. Jin, Z. Sun, N. Li, K. Xu, H. Jiang, N. Zhuang, Q. Huang, Y. Song, Y. Mu, and Z. Lin. 用于高效视频生成建模的金字塔流匹配. arXiv 预印本 arXiv:2410.05954, 2024.
[37] Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, Q. Huang, C. Bin, C. Song, D. Zhang, W. Ou, et al. 基于动态离散视觉分词的大语言模型统一语言-视觉预训练. 载于 Proc. Int'l Conf. Learning Representations, 2024.
[38] B. Jing, B. Berger, and T. Jaakkola. AlphaFold 结合流匹配用于生成蛋白质系综. 载于 Proc. Int'l Conf. Machine Learning, 2024.
[39] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park. 扩展 GAN 规模以实现文本到图像合成. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2023.
[40] S. Kim, K. Shih, J. F. Santos, E. Bakhturina, M. Desta, R. Valle, S. Yoon, B. Catanzaro, et al.
References
P-Flow:一种通过语音提示实现快速且数据高效的零样本文本转语音方法. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[原文]P-Flow: a fast and data-efficient zero-shot tts through speech prompting.In Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[41] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead,
A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proc. IEEE Int. Conf. Comput. Vision,
2023.
[42] M. Koupaee and W. Y. Wang. WikiHow: A large scale text summarization dataset. arXiv
preprint arXiv:1810.09305, 2018.
[43] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov,
M. Malloci, A. Kolesnikov, et al. The Open Images Dataset V...
[41] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment Anything. 载于 Proc. IEEE Int. Conf. Comput. Vision, 2023.
[42] M. Koupaee and W. Y. Wang. WikiHow:一个大规模文本摘要数据集. arXiv 预印本 arXiv:1810.09305, 2018.
[43] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. The Open Images Dataset V4:大规模统一图像分类、目标检测与视觉关系检测. International Journal of Computer Vision, 2020.
[44] H. Laurençon, D. van Strien, S. Bekman, L. Tronchon, L. Saulnier, T. Wang, S. Karamcheti, A. Singh, G. Pistilli, Y. Jernite, et al. 介绍 IDEFICS:对最先进视觉语言模型的开源复现, 2023. URL https://huggingface.co/blog/idefics.
[45] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar, et al. VoiceBox:大规模文本引导的多语言通用语音生成. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[46] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. SEED-Bench:通过生成式理解对多模态大语言模型进行基准测试. arXiv 预印本 arXiv:2307.16125, 2023.
[47] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li, Z. Liu, and C. Li. LLaVA-OneVision:简易视觉任务迁移. arXiv 预印本 arXiv:2408.03326, 2024.
[48] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5:提升文生图美学质量的三项见解. arXiv 预印本 arXiv:2402.17245, 2024.
[49] J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2:利用冻结图像编码器与大语言模型引导语言-图像预训练. 载于 Proc. Int'l Conf. Machine Learning, 2023.
[50] L. Li, Y. Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arXiv:提升大型视觉语言模型科学理解能力的数据集. 载于 Proc. Annu. Meeting Assoc. Comput. Linguistics, 2024.
[51] X. Li, F. Zhang, H. Diao, Y. Wang, X. Wang, and L.-Y. Duan. DenseFusion-1M:融合视觉专家以实现全面的多模态感知. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[52] Y. Li, Y. Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. 评估大型视觉-语言模型中的对象幻觉. 载于 Proc. Conf. on Empirical Methods in Natural Language Process., 2023.
References
Wen. 评估大型视觉-语言模型中的对象幻觉. 载于 Proc. Conf. on Empirical Methods in Natural Language Process., 2023.
[原文]Wen.Evaluating object hallucination in
large vision-language models. In Proc. Conf. on Empirical Methods in Natural Language
Process., 2023.
[53] Z. Li, X. Yang, K. Choi, W. Zhu, R. Hsieh, H. Kim, J. H. Lim, S. Ji, B. Lee, X. Yan, et al. MMSci: A multimodal multi-discipline dataset for phd-level scientific comprehension. In
AI for Accelerated Materials Design, 2024.
[54] Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao,
et al. Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained
chinese understanding. arXiv preprint arXiv:2405...
[53] Z. Li, X. Yang, K. Choi, W. Zhu, R. Hsieh, H. Kim, J. H. Lim, S. Ji, B. Lee, X. Yan, et al. MMSci:面向博士级科学理解的多模态多学科数据集. 载于 AI for Accelerated Materials Design, 2024.
[54] Z. Li, J. Zhang, Q. Lin, J. Xiong, Y. Long, X. Deng, Y. Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-DiT:一种具备细粒度中文理解能力的强大多分辨率扩散 Transformer. arXiv 预印本 arXiv:2405.08748, 2024.
[55] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. 用于生成建模的流匹配. 载于 Proc. Int'l Conf. Learning Representations, 2023.
[56] H. Liu, C. Li, Y. Li, and Y. J. Lee. 基于视觉指令微调的改进基线. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2024.
[57] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. LLaVA-NeXT:改进的推理、OCR 与世界知识, 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
[58] H. Liu, C. Li, Q. Wu, and Y. J. Lee. 视觉指令微调. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[59] H. Liu, W. Yan, M. Zaharia, and P. Abbeel. 基于 RingAttention 的百万长度视频与语言世界模型. arXiv 预印本 arXiv:2402.08268, 2024.
[61] X. Liu, C. Gong, and Q. Liu. 流直且快:利用修正流学习数据生成与迁移. 载于 Proc. Int'l Conf. Learning Representations, 2023.
[62] X. Liu, X. Zhang, J. Ma, J. Peng, et al. InstaFlow:单步即可实现高质量基于扩散的文本到图像生成. 载于 Proc. Int'l Conf. Learning Representations, 2024.
[63] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. MMBench:你的多模态模型是全能选手吗?载于 Proc. Eur. Conf. Computer Vision, 2024.
[64] H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, et al. DeepSeek-VL:迈向现实世界的视觉-语言理解. arXiv 预印本 arXiv:2403.05525, 2024.
[65] P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. IconQA:面向抽象图表理解与视觉语言推理的新基准. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2021.
References
Neural Inf. Process. Systems, 2021.
[原文]Neural Inf.Process. Systems, 2021.
[66] N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. SiT: Explor-
ing flow and diffusion-based generative models with scalable interpolant transformers.
arXiv preprint arXiv:2401.08740, 2024.
[67] Y. Ma, H. Yang, W. Wang, J. Fu, and J. Liu. Unified multi-modal latent diffusion for joint
subject and text conditional image generation. arXiv preprint arXiv:2303.09319, 2023.
[68] madebyollin. Megalith-10M, 2024. URL https://huggingface.co/datasets/made
byollin/megalith-10m.
[69] B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal...
[66] N. Ma, M. Goldstein, M. S. Albergo, N. M. Boffi, E. Vanden-Eijnden, and S. Xie. SiT:利用可扩展插值 Transformer 探索基于流和扩散的生成模型. arXiv 预印本 arXiv:2401.08740, 2024.
[67] Y. Ma, H. Yang, W. Wang, J. Fu, and J. Liu. 用于联合主体与文本条件图像生成的统一多模态潜在扩散模型. arXiv 预印本 arXiv:2303.09319, 2023.
[69] B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, et al. 语言模型是少样本学习者. arXiv 预印本 arXiv:2005.14165, 2020.
[70] A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque. ChartQA:一个结合视觉与逻辑推理的图表问答基准测试. 载于 Proc. Annu. Meeting Assoc. Comput. Linguistics, 2022.
[72] W. Peebles and S. Xie. 基于 Transformer 的可扩展扩散模型. 载于 Proc. IEEE Int. Conf. Comput. Vision, 2023.
[73] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. SDXL:改进用于高分辨率图像合成的潜在扩散模型. 载于 Proc. Int'l Conf. Learning Representations, 2024.
[75] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. 从自然语言监督中学习可迁移视觉模型. 载于 Proc. Int'l Conf. Machine Learning, 2021.
[76] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. 基于 CLIP 潜在表示的层次化文本条件图像生成. arXiv 预印本 arXiv:2204.06125, 2022.
[77] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. 基于潜在扩散模型的高分辨率图像合成. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2022.
[78] L. Ruan, Y. Ma, H. Yang, H. He, B. Liu, J. Fu, N. J. Yuan, Q. Jin, and B. Guo. MM-Diffusion:学习用于联合音频与视频生成的多模态扩散模型. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2022.
[79] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al.
References
具有深度语言理解的逼真文本到图像扩散模型. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2022.
[原文]Photorealistic text-to-image diffusion
models with deep language understanding.In Proc. Annu. Conf. Neural Inf. Process. Systems, 2022.
[80] S. Shah, A. Mishra, N. Yadati, and P. P. Talukdar. KVQA: Knowledge-aware visual question
answering. In Proc. AAAI Conf. on Artificial Intelligence, 2019.
[81] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards VQA models that can read. In Proc. IEEE Int’l Conf. Computer Vision and
Pattern Recognition, 2019.
[82] V. Singla, K. Yue, S. Paul, R. Shirkavand, M. Jayawardhana, A. Ganjdanesh, H. Huang,
A. Bhatele, G...
[80] S. Shah, A. Mishra, N. Yadati, and P. P. Talukdar. KVQA:知识感知视觉问答. 载于 Proc. AAAI Conf. on Artificial Intelligence, 2019.
[81] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. 迈向具备阅读能力的 VQA 模型. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2019.
[82] V. Singla, K. Yue, S. Paul, R. Shirkavand, M. Jayawardhana, A. Ganjdanesh, H. Huang, A. Bhatele, G. Somepalli, and T. Goldstein. 从像素到散文:一个密集图像描述的大型数据集. arXiv 预印本 arXiv:2406.10328, 2024.
[83] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. 基于随机微分方程的得分生成建模. 载于 Proc. Int'l Conf. Learning Representations, 2021.
[84] K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork. WIT:面向多模态多语言机器学习的基于维基百科的图像文本数据集. 载于 Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, 2021.
[85] K. Sun, J. Pan, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, et al. JourneyDB:生成式图像理解的基准测试. 载于 Proc. Annu. Conf. Neural Inf. Process. Systems, 2024.
[86] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. 自回归模型胜过扩散模型:用于可扩展图像生成的 LLaMA. arXiv 预印本 arXiv:2406.06525, 2024.
[87] Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang. 多模态生成式预训练. 载于 Proc. Int'l Conf. Learning Representations, 2024.
[88] C. Team. Chameleon:混合模态早期融合基础模型. arXiv 预印本 arXiv:2405.09818, 2024.
[89] G. Team. Gemini:一系列高性能多模态模型. arXiv 预印本 arXiv:2312.11805, 2023.
[90] S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, et al. Cambrian-1:面向多模态大语言模型的全开源视觉中心探索. arXiv 预印本 arXiv:2406.16860, 2024.
[91] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA:开放且高效的基础语言模型. arXiv 预印本 arXiv:2302.13971, 2023.
[92] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N.
References
Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. LLaMA 2:开放基座与微调聊天模型. arXiv 预印本 arXiv:2307.09288, 2023.
[原文]Bashlykov, S.Batra,
P. Bhargava, S. Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models.
arXiv preprint arXiv:2307.09288, 2023.
[93] C. N. Vasconcelos, A. Rashwan, A. Waters, T. Walker, K. Xu, J. Yan, R. Qian, Y. Li, S. LUO,
Y. Onoe, et al. Greedy growing enables high-resolution pixel-based diffusion models. Transactions on Machine Learning Research, 2024.
[94] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.
arXiv preprint arXiv:2409.12191, 2024.
[95]...
[93] C. N. Vasconcelos, A. Rashwan, A. Waters, T. Walker, K. Xu, J. Yan, R. Qian, Y. Li, S. LUO, Y. Onoe, et al. 贪婪生长实现高分辨率基于像素的扩散模型. Transactions on Machine Learning Research, 2024.
[94] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL:提升视觉语言模型在任意分辨率下的世界感知能力. arXiv 预印本 arXiv:2409.12191, 2024.
[95] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al. Emu3:下一词元预测即所需. arXiv 预印本 arXiv:2409.18869, 2024.
[96] S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie. ConvNeXt v2:与掩码自编码器协同设计与扩展卷积神经网络. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2023.
[97] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus:解耦视觉编码以实现统一的多模态理解与生成. arXiv 预印本 arXiv:2410.13848, 2024.
[98] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua. NExT-GPT:任意到任意的多模态大语言模型. 载于 Proc. Int'l Conf. Machine Learning, 2024.
[99] Y. Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y. Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. VILA-U:融合视觉理解与生成的统一基座模型. arXiv 预印本 arXiv:2409.04429, 2024.
[100] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o:单一 Transformer 统一多模态理解与生成. arXiv 预印本 arXiv:2408.12528, 2024.
[101] H. Ye, D.-A. Huang, Y. Lu, Z. Yu, W. Ping, A. Tao, J. Kautz, S. Han, D. Xu, P. Molchanov, et al. X-VILA:面向大语言模型的跨模态对齐. arXiv 预印本 arXiv:2405.19335, 2024.
[102] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al. 语言模型胜过扩散模型:分词器是视觉生成的关键. 载于 Proc. Int'l Conf. Learning Representations, 2024.
[103] S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie. 面向生成的表征对齐:训练扩散 Transformer 比你想象的更容易. arXiv 预印本 arXiv:2410.06940, 2024.
[104] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z.
References
Liu, X. Wang, and L. Wang. MM-Vet:评估大型多模态模型的综合能力. 载于 Proc. Int'l Conf. Machine Learning, 2024.
[原文]Liu, X.Wang, and L. Wang. MM-Vet: Evaluating
large multimodal models for integrated capabilities. In Proc. Int’l Conf. Machine Learning,
2024.
[105] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark
for expert AGI. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2024.
[106] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image
pre-training. In Proc. IEEE Int. Conf. Comput. Vision, 2023.
[107] C. Zhao, Y. Song, W. Wang, ...
[105] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al. MMMU:面向专家级通用人工智能的大规模多学科多模态理解与推理基准. 载于 Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2024.
[106] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. 用于语言-图像预训练的 Sigmoid 损失函数. 载于 Proc. IEEE Int. Conf. Comput. Vision, 2023.
[107] C. Zhao, Y. Song, W. Wang, H. Feng, E. Ding, Y. Sun, X. Xiao, and J. Wang. MonoFormer:一个同时用于扩散与自回归的 Transformer. arXiv 预印本 arXiv:2409.16280, 2024.
[108] C. Zhou, L. Yu, A. Babu, K. Tirumala, M. Yasunaga, L. Shamis, J. Kahn, X. Ma, L. Zettlemoyer, and O. Levy. Transfusion:使用单一多模态模型预测下一个词元并生成扩散图像. arXiv 预印本 arXiv:2408.11039, 2024.
[109] Y. Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang. LLaVA-Phi:基于小型语言模型的高效多模态助手. arXiv 预印本 arXiv:2401.02330, 2024.
[110] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next:借助 Next-DiT 使 Lumina-T2X 性能更强、速度更快. arXiv 预印本 arXiv:2406.18583, 2024.
[原文]A. Performance Analysis of 256 Resolution Model
We trained our model at two resolutions: 256 × 256 and 384 × 384. The main paper presents
results from the 384 × 384 model as our primary results. Here, we provide a comprehensive
evaluation of the 256 × 256 model’s performance. The visual understanding performances
are presented in Tab. 1. The generation capabilities are evaluated using GenEval [28], DPG-
Benchmark [34], and MJHQ FID-30k [48], with results shown in Tab. 2 and 3. Table 1 | Results on visual understanding tasks. Model
LLM Params POPE↑MME-P↑MMB𝑑𝑒𝑣↑SEED↑VQAv2𝑡𝑒𝑠𝑡↑GQA↑MM-Vet↑
JanusFl...
附录 在图像生成方面,本模型在视觉质量及与输入提示词的语义对齐上均展现出优异性能。 用户:请编写一段Python代码来绘制此图表。 JanusFlow:以下是绘制该图表的Python代码: ```python import matplotlib.pyplot as plt # Data values = [3, 7, 5, 2] # Create the bar chart plt.figure(figsize=(10, 6)) plt.bar(range(len(values)), values) # Title plt.title('Simple Bar Chart') # Labels plt.xlabel('Categories') plt.ylabel('Values') # Show the plot plt.show() ``` 用户:请解释这张图片为何有趣。 JanusFlow:该图像很可能是对名画《蒙娜丽莎》的幽默或艺术化再创作,以犬只替代了人物。犬只的表情与姿态与原画中标志性的姿势相似,为作品增添了一丝俏皮与意想不到的趣味。 用户:请提取此段落中的文本。 JanusFlow:为提升此类强大模型的可及性,同时降低其显著的资源消耗,亟需一种能够同时降低训练与采样计算复杂度的方法。因此,在无损模型性能的前提下降低扩散模型(DMs)的计算需求,是提升其可及性的关键。 用户:图中人物是谁? JanusFlow:乔治·W·布什。 24 宏伟的教堂建筑,天堂与地狱之战,燃烧的教堂,8K超高清、极致锐利,弗朗西斯科·戈雅风格的文艺复兴画作。 画面中央为一位帅气的24岁青年,背景呈天空蓝,佩戴眼镜,细节极为丰富,呈现动漫风格。 一只神情愉悦、充满梦幻感的猫头鹰怪物栖于树枝之上,周围环绕着五彩闪烁的粒子,背景为森林,羽毛细节刻画精细。 一名头戴软呢帽的男子,呈现黑手党风格,写实摄影质感,细节繁复,光影效果奇幻,背景鲜明,纹理复杂,色彩丰富,写实风格,正面视角。 生动描绘了北极光在冰岛覆雪山峦上空舞动,将迷人的辉光洒满天际。 一幅暗调、高对比度的渲染图像,迷幻风格的“生命之树”熠熠生辉,照亮了神秘幽深洞穴环境中盘旋飞舞的尘埃粒子。
[原文]For image generation, our model exhibits strong performance in both visual quality
and semantic alignment with input prompts.User: Please write a python code to draw this plot. JanusFlow: Here is the Python code to draw this plot:
```python
import matplotlib.pyplot as plt
# Data
values = [3, 7, 5, 2]
# Create the bar chart
plt.figure(figsize=(10, 6))
plt.bar(range(len(values)), values)
# Title
plt.title('Simple Bar Chart')
# Labels
plt.xlabel('Categories')
plt.ylabel('Values')
# Show the plot
plt.show()
```
User: Explain why this image is funny. JanusFlow: This image is likely to be a
humorous...
[原文]In
the background, there's an out-of-focus scene
that includes greenery and possibly some
structures or trees shrouded by mist or fog,
giving it a serene yet slightly eerie atmosphere.This photograph employs shallow depth of field
to emphasize the mushrooms while blurring the
surroundings for artistic effect. The image captures a vast ocean view at
either sunrise or sunset, with soft pink hues
near the horizon blending into darker clouds
above. Waves crash against rugged black
rocks on the right, where water flows down
onto smaller stones below. In the
foreground, dry grass contrasts with the
...