| CARVIEW |
Hello VITA
We are announcing VITA, the first-ever open-source Multimodal LLM
that can process Video, Image, Text,
and Audio, and meanwhile
has an advanced multimodal interactive experience.
All videos on this page are at 1x real time.
More Examples
Multimodal Foundation Capabilities
High-Resolution Perception:
田林地道该怎么走?有什么注意事项吗?
VITA:
你需要右转,并且注意有潮汐车道。
Dese OCR:
预防措施有哪些?
VITA:
远离热源、火花、明火、热表面。禁止吸烟。避免往明火或其他火源上喷射。压力容器:禁止戳穿或烧毁,即使在使用后。应在通风环境下使用。处理电器前,应把电源关闭。
Math:
请帮我求解这个问题。
VITA:
这张图片展示了一张手写的数学问题。具体内容如下:
1. 题目:
[y = \sin x + e^{ix}]
2. 条件:
[x = \pi]
3. 问题:
[y = ?]
### 解题过程:
1. 首先,我们需要计算 ( \sin x ) 和 ( e^{ix} ) 在 ( x = \pi ) 时的值。
2. 对于 ( \sin x ),当 ( x = \pi ) 时,( \sin \pi = 0 )。
3. 对于 ( e^{ix} ),我们需要使用欧拉公式 ( e^{ix} = \cos x + i \sin x )。
- 当 ( x = \pi ) 时,( \cos \pi = -1 ),( \sin \pi = 0 )。
- 因此,( e^{i\pi} = -1 + i \cdot 0 = -1 )。
4. 所以,当 ( x = \pi ) 时,( y = \sin x + e^{ix} = 0 + (-1) = -1 )。
### 答案:
[y = -1]
Coding:
Can you translate this into Python code?
VITA:
Reasoning:
我可以在这条路上骑自行车吗?为什么?
VITA:
是的,你可以在这条路上骑自行车。在图片中,有一个“禁止进入”的标志,但下面有一个例外,允许自行车和紧急车辆进入。这表明这条街道对自行车是开放的,并且可能是一个指定的自行车道或自行车友好的区域。
Reasoning with Calculation:
How many calories are there in 5 slices of this?
VITA:
There are 60 calories per serving, and each serving is 3 slices. So, 5 slices would be 60 calories / 3 slices = 20 calories per slice. Therefore, 5 slices would be 20 calories/slice x 5 slices = 100 calories.
Video:
Please describe this video.
VITA:
The video begins with a nighttime scene of a street lined with buildings. Red and white fireworks explode in the sky above the buildings. The camera pans slightly to the right, following the trajectory of the fireworks. As the fireworks continue, the camera moves again, now focusing on a different section of the street with a different building. The fireworks continue to explode in the sky. In summary, the video shows a nighttime street scene with fireworks exploding in the sky. The camera moves to follow the fireworks, showcasing different sections of the street.
Video:
What is the gorilla doing? Please give a detailed description.
VITA:
The gorilla is hanging from a wooden structure, swinging back and forth.
Video:
Please describe the dog's movements in detail.
VITA:
The dog is standing on its hind legs, with its front paws resting on the bars of a white metal gate. It is looking directly at the camera. The dog then lowers itself to all fours and walks away from the gate.
VITA Overview
The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity
in practical applications, yet open-source models rarely excel in both areas. In this paper, we
introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept
at simultaneous processing and analysis of Video, Image,
Text, and Audio modalities, and meanwhile has an advanced multimodal
interactive experience. Our work distinguishes from existing open-source MLLM through three key
features:
Omni Multimodal Understanding. VITA demonstrates robust foundational
capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance
across a range of both unimodal and multimodal benchmarks.
Non-awakening Interaction. VITA can be activated and respond to user audio
questions in the environment without the need for a wake-up word or button.
Audio Interrupt Interaction. VITA is able to simultaneously track and filter
external queries in real-time. This allows users to interrupt the model's generation at any time with
new questions, and VITA will respond to the new query accordingly.
VITA is capable of processing inputs in the form of pure text/audio, as well as video/image combined
with text/audio. Besides, two key techniques are adopted to advance the multimodal interactive
experience:
State Token. We set different state tokens for different query inputs. <1>
corresponds to the effective query audio, such as “what is the biggest animal in the world?”, for
which we expect a response from the model. <2> corresponds to the noisy audio, such as someone in the
environment calls me to eat, for which we expect the model not to reply. <3> corresponds to the
query text, i.e., the question given by the user in text form. During the training phase, we try
to teach the model to automatically distinguish different input queries. During the deployment
phase, with <2> we can implement non-awakening interaction.
Duplex Scheme. We further introduce a duplex scheme for the audio
interrupt interaction. Two models are running at the same time, where the generation model is
responsible for handling user queries. When the generation model starts working, the other model
monitors the environment. If the user interrupts with another effective audio query, the
monitoring model aggregates the historical context to respond to the latest query, while the
generation model is paused and tune to monitor, i.e., the two models swap identities.
Experimental Results
Comparison of official Mixtral 8x7B Instruct and our trained Mixtral 8x7B.
Evaluation of Error Rate on ASR tasks.
Evaluation on image and video understanding.
Citation
@article{fu2024vita,
title={VITA: Towards Open-Source Interactive Omni Multimodal LLM},
author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Wang, Xiong and Yin, Di and Ma, Long and Zheng, Xiawu and He, Ran and Ji, Rongrong and Wu, Yunsheng and Shan, Caifeng and Sun, Xing},
journal={arXiv preprint arXiv:2408.05211},
year={2024}
}