| CARVIEW |
Welcome to my academic homepage. I am a machine learning researcher. I received my Ph.D. in Computer Science at University of California, Los Angeles (UCLA) . I went to Tsinghua University for undergraduate study in Computer Science.
I work on multimodal learning for understanding, reasoning and skill learning. In particular, I'm interested in building models/agents that can learn from 2D/3D vision and text data, and perform a wide range of reasoning and embodied control tasks. Some of my research keywords can be found below:
- Multimodal learning: Vision and language, Visual reasoning, 3D vision, Generalist models
- Representation learning: Zero-shot and few-shot learning, Generative model
- Embodied agents: Reinforcement learning and imitation, Robotics, Sensor fusion
News
- [06/2025] Some of our recent work on robotic manipulation: Falcon (2-8x accelerated diffusion policy) , AdaReP (MPC w/ 10x efficiency) , ManiGen (synthetic manipulation data pump) , COIN (benchmarking VLAs on both dexterity and reasoning) ,
- [01/2025] Some recent efforts on perception for embodied agents (long-form videos, dynamic scenes): Embodied VideoAgent , LongViTU
- [12/2024] Some recent efforts on multi-modal, open-world agents: ROCKET-1 , Multi-modal Agent Tuning
- [10/2024] We will be hosting NeurIPS 2024 Workshop on Open-World Agents . Come join us in Vancouver, BC, Canada this winter!
Selected Publications
Preprint
LongViTU: Instruction Tuning for Long-Form Video Understanding
arXiv preprint / arXiv / Project
Synthetic data for long-form video understanding.
Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
arXiv preprint / arXiv / Project
🐭 RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation
arXiv preprint / arXiv / Project / Demo
Significant improvement on creativity writing, codegen, and math problems.
HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving
arXiv preprint / Paper / Click Here to Play HALMA! / arXiv
Conference and Journal
NEP: Autoregressive Image Editing via Next Editing Token Prediction
NeurIPS 2025 / arXiv / Project
Autoregressive T2I with native image editing and test-time scaling.
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
ICCV 2025 / arXiv / Project
A memory framework for embodied agents in household environments (and beyond!)
Spotlight presentation.
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
CVPR 2025 / arXiv / Project
Learning to interact with all your surroundings, from self-supervision, enables open-world agents :)
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
ICLR 2025 / arXiv / Project
Spotlight presentation.
JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
T-PAMI / arXiv / Project / Code
Embodied RAG meets open-world agents.
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
NeurIPS 2024 D&B Track / arXiv / Demo / Project / Code
Free-form and region-based image editing made easy with language.
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
NeurIPS 2024 / arXiv / Project / Code
Native modeling of multimodal interaction (VLA) data.
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
ECCV 2024 / arXiv / Project
Zero-shot long-form video understanding, shrinking the gap to Gemini.
Unifying 3D Vision-Language Understanding via Promptable Queries
ECCV 2024 / arXiv / Project / Code
Unifying open-vocabulary perception and reasoning in 3D world.
LEO: An Embodied Generalist Agent in 3D World
ICML 2024 / arXiv / Project / Code / Demo / Video
MindAgent: Emergent Gaming Interaction
NAACL 2024 Findings / Paper / arXiv / Project / Code
Benchmark and other infra for LLM + general multi-player gaming.
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
CVPR 2024 / arXiv / Project
A self-improved language agent that sharpens its tools.
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
ICLR 2024 / Paper / arXiv / Demo / Code
Ranked 1st on MMBench and MME in 08/2023.
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
ICLR 2024 / Paper / arXiv / Project / Code
The third chapter of the Bongard trilogy, for the LM era (chapter 1, chapter 2)
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
NeurIPS 2023 / Paper / arXiv / Project / Code
Best paper award, ICML-23 TEACH Workshop
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
ICCV 2023 / arXiv / Project / Code
A new 3D-Language foundation model
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
CVPR 2023 / arXiv / Project / Code
SQA3D: Situated Question Answering in 3D Scenes
ICLR 2023 / Paper / arXiv / Slides / Project / Code / Benchmark
A new quest: embodied scene understanding
Latent Diffusion Energy-Based Model for Interpretable Text Modeling
ICML 2022 / Paper / arXiv / Code
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
CVPR 2022 / Paper / Poster / Slides / Project / arXiv / Code / Bibtex
Oral presentation
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
ICLR 2022 / Paper / Poster / Slides / Project / OpenReview / arXiv / Code / Bibtex
Unsupervised Foreground Extraction via Deep Region Competition
NeurIPS 2021 / Paper / arXiv / Code
Adversarial Option-Aware Hierarchical Imitation Learning
ICML 2021 / Paper / arXiv / Code
Spotlight presentation
Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning
AAAI 2020 / Paper / Project Page / arXiv / Code
Oral presentation
Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance
AAAI 2020 / Paper / Project Page / arXiv / Code
also in Structure & Priors in Reinforcement Learning Workshop @ ICLR 2019
Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement
NeurIPS 2019 / Paper / Project Page / arXiv / Code
Spotlight presentation
Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring
IROS 2019 / Paper / Project Page / arXiv / Code / Video
PointNetGPD: Detecting Grasp Configurations from Point Sets
ICRA 2019 / Paper / Project Page / arXiv / Code / Video
Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network
ICRA 2019 / Paper / Project Page / arXiv / Code / Video
Task Transfer by Preference-Based Cost Learning
AAAI 2019 / Paper / Project Page / arXiv / Code
Spotlight presentation
Experience
-
2022.9 - 2023.1, Research Scientist Intern, DeepMind
Mentors: Kory Mathewson, Peter Humphreys, Owen He, Adam Santoro, Timothy Lillicrap and Doina Precup. -
2021.6 - 2021.12, Research Intern, NVIDIA Research
Mentors: Weili Nie, Huaizu Jiang, Chaowei Xiao, Zhiding Yu, Yuke Zhu and Anima Anandkumar. -
2020.6 - 2021.5, Research Intern, Google Brain Robotics
Mentors: Pannag Sanketi and Laura Graesser. -
2019.6 - 2019.9, Research Intern, ByteDance AI Lab
Mentors: Tao Kong and Lei Li. -
2017.7 - 2017.9, Research Intern, CVRP Lab, School of Computer Science, National University of
Singapore
Mentor: Gim Hee Lee.
© Xiaojian Ma 2024