You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1S-Lab, NTU 2Tencent 3Tsinghua University 4Nanjing University
* Equal Contribution ✉ Corresponding Author
arXiv Paper:
Model Checkpoints:
📢 News
[04/2025] Insight-V is selected as Highlight paper by CVPR2025!
[02/2025] Insight-V is accepted by CVPR2025!
[11/2024] 🔧🔨Training & Inference Scripts Release! Try Insight-V on your own!
[11/2024] 🔥 🚀Introducing Insight-V! An early attempt to explore long-chain visual reasoning with MLLMs.
[Paper]: Detailed introduction of Insight-V, including structured, long-chain data generation pipeline and effective multi-agent system design!
[Checkpoints]: We release model checkpoints on LLaVA-NeXT-LLaMA3 and our base model.
🚀 Introducing Insight-V
Main idea of Insight-V
Insight-V is an early effort to explore long-chain visual reasoning with MLLMs.
Insight-V offers 1) a scalable data generation pipeline for long-chain, high-quality reasoning data, 2) a multi-agent system that decomposes visual reasoning tasks into reasoning and summarization, and 3) a two-stage training pipeline to enhance visual reasoning capabilities. Together, these contributions address key challenges in visual reasoning, providing a solid foundation for future research in MLLM reasoning.
Overview of Data Generation Pipeline
The reasoning processes are generated progressively through a reasoning generator, and then fed into a multi-granularity assessment system to ensure high-quality reasoning.
Overview of Multi-Agent System
We derive a multi-agent system from a single model. By decomposing the task into reasoning and summarization, the two agents collaborate to enhance the overall reasoning capability.
✅ TODO List
Release paper on arXiv
Release Insight-V models.
Demo code for generation.
All the training and inference code.
Evaluation code for visual reasoning benchmarks.
Insight-V SFT Data.
Insight-V with stronger MLLMs.
📃 Main Results
Results on Visual Reasoning Benchmarks
Results on Other Image Benchmarks
Qualitative Results
Citation
If you find it useful for your research and applications, please cite our paper using this BibTeX:
@article{dong2024insight,
title={Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models},
author={Dong, Yuhao and Liu, Zuyan and Sun, Hai-Long and Yang, Jingkang and Hu, Winston and Rao, Yongming and Liu, Ziwei},
journal={arXiv preprint arXiv:2411.14432},
year={2024}
}