| CARVIEW |
DreamLLM: Synergistic Multimodal Comprehension and Creation
ICLR 2024 Spotlight
Abstract
This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (LLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy.
Given any interleaved documents, DreamLLM is trained by generative modeling of everything including texts and images. The images are generated by SD image decoder conditioned on dream queries gathered MLLMs' sematics. DreamLLM also learns where to generate images by prediction special dream tokens.
Multimodal dialogue examples
Einstein: Good afternoon, Mr. Messi. How are you?
Messi: Hello, Dr. Einstein. I am doing well. Thank you for having me.
Einstein: I am pleased to encounter a fellow genius in the world of sports, and I would love to hear your insights on the game.
Messi: Thank you, sir. As a soccer player, I feel that the key to success is having a strong mindset, an understanding of the game, and the ability to adapt to different situations. I try to learn from each game and work on my weaknesses.
Einstein: Indeed, I have said similar things about the nature of intelligence and knowledge. Learning from past experiences is crucial to personal growth and understanding. It reminds me of a famous quote by the German philosopher Johann Gottlieb Fichte: “If you would be a genius, think deeply and long, and incessantly about one thing. ”
Messi: I understand the value of continuous learning, and I try to apply it in my training and on the field. I am always trying to improve my skills and learn from my colleagues in the club.
Einstein: That is a wonderful attitude to have. I am also very impressed by your passion for the field. In my opinion, passion is an integral part of success. It fuels the drive to achieve greatness, and it is especially evident in my own field of physics. For instance, I once said, “The passion for truth is the foundation for the sciences.”
Messi: It is a great feeling to know that I am playing a sport that is full of passionate individuals. I am proud to be a part of that world and do my part in contributing to it.
Einstein: I am sure that you, Mr. Messi, would agree that having a passion for knowledge can lead to discoveries that change the world. For example, my theory of relativity has had a profound impact on our understanding of space, time, and gravity. It was a result of decades of contemplation and research.
Messi: The world of sports and the world of science are not so different when it comes to the pursuit of excellence. I am honored to be in the presence of a true genius like you, Dr. Einstein.
Einstein: I feel the same way, and I would like to thank you for this opportunity to engage in a conversation about our respective fields. The world is a wondrous place, and I am grateful to be a part of it.
Next, I would look for the mobile phone on the table, as shown in the image below.
Text-to-image generation
Oil-on-canvas painting of a blue night sky with roiling energy.
A fuzzy and bright yellow crescent moon shining at the top.
Below the exploding yellow stars and radiating swirls of blue, a distant village sits quietly on the right.
Connecting earth and sky is a flame-like cypress tree with curling and swaying branches on the left.
A church spire rises as a beacon over rolling blue hills.
Generate images from text yourself!
How is image generated through dream queries?
A cat and a whisky.
A polar bear in the forest.
Citation
@inproceedings{dong2024dreamllm,
author = {Dong, Runpei and Han, Chunrui and Peng, Yuang and Qi, Zekun and Ge, Zheng and Yang, Jinrong and Zhao, Liang and Sun, Jianjian and Zhou, Hongyu and Wei, Haoran and Kong, Xiangwen and Zhang, Xiangyu and Ma, Kaisheng and Yi, Li},
title = {Dream{LLM}: Synergistic Multimodal Comprehension and Creation},
booktitle = {The Twelfth International Conference on Learning Representations},
url = {https://openreview.net/forum?id=y01KGvd9Bw},
year = {2024},
}