I currently focus on developing cross-embodiment robot intelligence—agents that learn from the collective experience of diverse robots, transfer to new hardware, and continually improve over time. This holds the potential to enable foundation models that adapt to worn motors, customized hardware, or even home-built robots without precise kinematic models. I study these questions through both model-based and model-free methods across tasks from locomotion to dexterous manipulation. In earlier work, I have explored multimodal perception, mobile manipulation, world model learning, and robot navigation, and I continue to have broad interests in these areas.
Three papers accepted to CoRL 2025: Embodiment Scaling Laws, Diffusion Dynamics Models, and SAVOR. If you are interested in cross-embodiment learning, world models, or affordance learning, feel free to check them out!
Dynamics models that predict the effects of physical interactions are essential for planning and control in robotic manipulation. Although models based on physical principles often generalize well, they typically require full-state information, which can be difficult or impossible to extract from perception data in complex, real-world scenarios. Learning-based dynamics models provide an alternative by deriving state transition functions purely from perceived interaction data, enabling the capture of complex, hard-to-model factors and predictive uncertainty and accelerating simulations that are often too slow for real-time control. Recent successes in this field have demonstrated notable advancements in robot capabilities, including long-horizon manipulation of deformable objects, granular materials, and complex multiobject interactions such as stowing and packing. A crucial aspect of these investigations is the choice of state representation, which determines the inductive biases in the learning system for reduced-order modeling of scene dynamics. This article provides a timely and comprehensive review of current techniques and trade-offs in designing learned dynamics models, highlighting their role in advancing robot capabilities through integration with state estimation and control and identifying critical research gaps for future exploration. Dynamics models learned from real-world interactions with task-aligned representations empower robotic manipulation.
Towards Embodiment Scaling Laws in Robot Locomotion
Cross-embodiment generalization underpins the vision of building generalist embodied agents for any robot, yet its enabling factors remain poorly understood. We investigate embodiment scaling laws, the hypothesis that increasing the number of training embodiments improves generalization to unseen ones, using robot locomotion as a test bed. We procedurally generate approximately 1,000 embodiments with topological, geometric, and joint-level kinematic variations, and train policies on random subsets. We observe positive scaling trends supporting the hypothesis, and find that embodiment scaling enables substantially broader generalization than data scaling on fixed embodiments. Our best policy, trained on the full dataset, transfers zero-shot to novel embodiments in simulation and the real world, including the Unitree Go2 and H1. These results represent a step toward general embodied intelligence, with relevance to adaptive control for configurable robots, morphology–control co-design, and beyond.
RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing
Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems.
@inproceedings{ai2024robopack,title={RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing},author={Ai*, Bo and Tian*, Stephen and Shi, Haochen and Wang, Yixuan and Tan, Cheston and Li, Yunzhu and Wu, Jiajun},booktitle={Robotics: Science and Systems (RSS)},year={2024},url={https://arxiv.org/abs/2407.01418},note={Abridged in ICRA 2024 workshops
[ViTac](https://shanluo.github.io/ViTacWorkshops/),
[3DVRM](https://3d-manipulation-workshop.github.io/),
[Future Roadmap for Sensorimotor Skills](https://icra-manipulation-skill.github.io/), and RSS 2024 workshop
[Priors4Robots](https://sites.google.com/alora.tech/priors4robots24).}}
Invariance is Key to Generalization: Examining the Role of Representation in Sim-to-Real Transfer for Visual Navigation
The data-driven approach to robot control has been gathering pace rapidly, yet generalization to unseen task domains remains a critical challenge. We argue that the key to generalization is representations that are (i) rich enough to capture all task-relevant information and (ii) invariant to superfluous variability between the training and the test domains. We experimentally study such a representation—containing both depth and semantic information—for visual navigation and show that it enables a control policy trained entirely in simulated indoor scenes to generalize to diverse real-world environments, both indoors and outdoors. Further, we show that our representation reduces the A-distance between the training and test domains, improving the generalization error bound as a result. Our proposed approach is scalable: the learned policy improves continuously, as the foundation models that it exploits absorb more diverse data during pre-training.
Deep Visual Navigation under Partial Observability
How can a robot navigate successfully in rich and diverse environments, indoors or outdoors, along office corridors or trails on the grassland, on the flat ground or the staircase? To this end, this work aims to address three challenges: (i) complex visual observations, (ii) partial observability of local visual sensing, and (iii) multimodal robot behaviors conditioned on both the local environment and the global navigation objective. We propose to train a neural network (NN) controller for local navigation via imitation learning. To tackle complex visual observations, we extract multi-scale spatial representations through CNNs. To tackle partial observability, we aggregate multi-scale spatial information over time and encode it in LSTMs. To learn multimodal behaviors, we use a separate memory module for each behavior mode. Importantly, we integrate the multiple neural network modules into a unified controller that achieves robust performance for visual navigation in complex, partially observable environments. We implemented the controller on the quadrupedal Spot robot and evaluated it on three challenging tasks: adversarial pedestrian avoidance, blind-spot obstacle avoidance, and elevator riding. The experiments show that the proposed NN architecture significantly improves navigation performance.
@inproceedings{ai2022deep,author={Ai, Bo and Gao, Wei and Vinay and Hsu, David},title={Deep Visual Navigation under Partial Observability},booktitle={International Conference on Robotics and Automation (ICRA)},pages={9439--9446},publisher={{IEEE}},year={2022},url={https://doi.org/10.1109/ICRA46639.2022.9811598},doi={10.1109/ICRA46639.2022.9811598},timestamp={Mon, 04 Dec 2023 21:29:46 +0100},biburl={https://dblp.org/rec/conf/icra/AiGVH22.bib},bibsource={dblp computer science bibliography, https://dblp.org},}
"We shall not cease from exploration. And the end of all our exploring will be to arrive where we started, and know the place for the first time." — T. S. Eliot