| CARVIEW |
Peter Tong
Hi, I am Peter Tong, also go by the name Shengbang Tong(童晟邦). I am a second-year PhD student in NYU Courant CS advised by Professor Yann LeCun and Professor Saining Xie. I am funded by OpenAI Superalignment Fellowship (2024-2025) and Meta (2025-2026). I graduated from UC Berkeley with a triple major in Computer Science, Applied Mathematics (Honor) and Statistics (Honor). I am from Nanjing, China and Melbourne, Australia.
Research
I graduated from UC Berkeley with a triple major. I am a second-year CS PhD student in NYU Courant advised by Prof. Yann LeCun and Prof. Saining Xie. I was a researcher in Berkeley Artificial Intelligence Lab(BAIR) advised by Prof. Yi Ma and Prof. Jacob Steinhardt. I am interested in world model, unsupervised/self-supervised learning, generative models and multimodal models. I would like to thank all my mentors-Yubei, Xili, Erik and collaborators for the incredible journey I had in my undergrad.
News
- 2025-06: Our papers Web-SSL and MetaMorph were accepted at ICCV 2025!
- 2025-05: I am re-joining FAIR as a research scientist intern with the amazing Koustuv Sinha!
- 2024-09: Our paper RLVLM was accepted at NeurIPS2024, and Cambrian was accepted at NeurIPS 2024 as an Oral Paper!
- 2024-05: I joined FAIR, Meta for summer internship with Dr. Zhuang Liu, yayyyyy!
- 2024-04: I received OpenAI Superalignment Fellowship! Thank you OpenAI!!! Looking forward to the cool works.
- 2024-04: Our paper Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs was accepted at CVPR 2024 as an Oral Paper!
- 2024-01: Our paper Image Clustering in the Age of Pretrained Models was accepted at ICLR 2024!
- 2023-12: Our papers were accepted at CPAL 2024!
- 2023-09: I am helping organizing the QVCV workshop in ICCV. See you all in Paris!
- 2023-09: Our papers MultiMon and CRATE(whitebox-transformer) were accepted at NeurIPS 2023!
- 2023-07: Our paper Manifold Linearizing and Clustering was accepted at ICCV 2023!
- 2023-05: I graduated from UC Berkeley with triple degree Applied Math (Honor), Statistics (Honor) and Computer Science (No honor, because I didn't want to take 16b and 61c too early, but I published quite some interesting work so yay)!!!
- 2023-04: I will be a CS PhD student in NYU Courant advised by Professor Yann LeCun and Professor Saining Xie. Looking forward to working with Yann and Saining in New York!
- 2023-01: Our paper incremental-CTRL was accepted at ICLR 2023!
Publications
Scaling Language-Free Visual Representation Learning
We introduce Visual SSL 2.0: Scaling up models, data to billion scale and adding VQA to the evaluation suite. Vision-only models scale with model size and data size, eventually catching up/surpassing CLIP models.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Visual understanding and visual generation are mutually beneficial in unified models! But visual understanding data is much more effective than visual generation. Capabilities in LLM can also transfer to unified models such as implicit reasoning!
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
We provide a vision-centric exploration or cookbook in MLLMs, systematically studying visual representation, vision-language connector, instruction tuning data, training recipe and evaluation protocols. We propose new vision-centric benchmarks, spatial-aware connector, data collection and curation of instruction data, and release very competitive 8B, 13B and 34B models on par with GPT-4V and Gemini.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Is vision good enough for language? Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. We identify 'CLIP-blind pairs' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark.
Mass-Producing Failures of Multimodal Systems with Language Models
Deployed multimodal systems can fail in ways that evaluators did not anticipate. In order to find these failures before deployment, we introduce MULTIMON, a system that automatically identifies systematic failures.
-
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models
Irving Fang*, Juexiao Zhang*, Shengbang Tong, Chen FengTechnical Report -
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Junhong Shen*, Hao Bai*, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral KumarTechnical Report -
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
Chun-Hsiao Yeh*, Chenyu Wang*, Shengbang Tong, Ta-Ying Cheng, Rouyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi MaTechnical Report -
Scaling Language-Free Visual Representation Learning
David Fan*, Shengbang Tong*, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar†, Saining Xie†ICCV 2025 Highlight -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu*, Yuexiang Zhai*, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi MaICML 2025 -
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang LiuICCV 2025 -
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham NeubigACL 2025 -
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong*, Ellis Brown*, Penghao Wu*, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining XieNeurIPS 2024 Oral -
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai, Hao Bai*, Zipeng Lin*, Jiayi Pan*, Shengbang Tong*, Yifei Zhou*, Alen Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey LevineNeurIPS 2024 -
Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning
Shentong Mo, Shengbang TongNeurIPS 2024 Spotlight -
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining XieCVPR 2024 Oral -
Investigating the Catastrophic Forgetting in Multimodal Large Language Models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi MaCPAL 2024 -
Emergence of Segmentation with Minimalistic White-Box Transformers
Yaodong Yu*, Tianzhe Chu*, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, Yi MaCPAL 2024 -
Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription
Hongxiang Zhao*, Xili Dai*, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi MaTechnical Report -
Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong*, Erik Jones*, Jacob SteinhardtNeurIPS 2023 -
Image Clustering in the Age of Pretrained Models
Tianzhe Chu*, Shengbang Tong*, Tianjiao Ding*, Xili Dai, Benjamin Haeffele, Rene Vidal, Yi MaICLR 2024 -
White-Box Transformers via Sparse Rate Reduction
Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, Yi MaNeurIPS 2023 -
EMP-SSL: Towards Self-Supervised Learning in One Epoch
Shengbang Tong*, Yubei Chen*, Yi Ma, Yann LeCunTechnical Report -
Unsupervised Manifold Linearizing and Clustering
Tianjiao Ding, Shengbang Tong, Kwan Ho Ryan Chan, Xili Dai, Yi Ma, Benjamin David HaeffeleICCV 2023 -
Closed-Loop Transcription Via Convolutional Sparse Coding
Xili Dai, Ke Chen, Shengbang Tong, Jingyuan Zhang, Xingjian Gao, Mingyang Li, Druv Pai, Yuexiang Zhai, Xiaojun Yuan, Heung Yeung Shum, Lionel M.Ni, Yi MaCPAL 2024 -
Unsupervised Learning of Structured Representation via Closed-Loop Transcription
Shengbang Tong*, Xili Dai*, Yubei Chen, Mingyang Li, Zengyi Li, Brent Yi, Yann LeCun, Yi MaCPAL 2024 -
Revisiting Sparse Convolutional Model for Visual Recognition
Xili Dai*, Mingyang Li*, Pengyuan Zhai, Shengbang Tong, Xingjian Gao, Shaolun Huang, Zhihui Zhu, Chong You, Yi MaNeurIPS 2022 -
Incremental Learning of Structured Memory via Closed-Loop Transcription
Shengbang Tong, Xili Dai, Ziyang Wu, Mingyang Li, Brent Yi, Yi MaICLR 2023 -
Closed-Loop Data Transcription to an LDR via Minimaxing Rate Reduction
Xili Dai*, Shengbang Tong*, Mingyang Li*, Ziyang Wu*, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Michael Psenka, Xiaojun Yuan, Heung Yeung Shum, Yi MaEntropy Journal