| CARVIEW |
Learning Generalizable Object Fetching
in Cluttered Scenes via Zero-Shot Sim2Real
Conference on Robot Learning 2025 (Oral Presentation)
FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real
Haoran Li Dongbin Zhao Zhizheng Zhang † He Wang †
Overview
FetchBot is a sim-to-real framework achieving generalizable object fetching in cluttered scenes. Its systematic design enables policy generalization and sim-to-real transferability across diverse objects, varying layouts, and multiple end-effectors.
Summary Video
FetchBot
Method Pipeline
FetchBot Pipeline. In the (A) data generation stage, we use UniVoxGen to generate a diverse set of cluttered scenes and employ an RL-based Oracle policy to collect representative demonstrations. In the (B) 3D vision encoder pretraining stage, we first use the foundation model's predicted depth as an intermediate representation to mitigate the sim-to-real gap, then introduce an occupancy prediction task to learn a complete scene representation that can infer occluded regions. In the (C) vision policy training stage, we distill these expert demonstrations into a vision-based policy through imitation learning, which can achieve (D) zero-shot sim-to-real.
Voxel-based Cluttered Scene Generator
UniVoxGen efficiently generates realistic cluttered scenes by voxelizing objects and applying lightweight voxel operations—Union, Intersection, Difference, and Transformation. This enables the creation of a large-scale dataset of one million scenes, supporting oracle policy training and providing dense ground truth for occupancy prediction.
Generated cluttered scenes by UniVoxGen, including shelf, tabletop, drawer, and storage rack environments.
Dynamics-Aware Oracle Policy
Scene Encoder Network.
The oracle policy leverages a hierarchical scene encoder to capture both local object details and global scene context, enabling robust representation of cluttered environments. Combined with a reward design that balances task success and minimal disturbance, it facilitates precise and efficient demonstration generation.
Vision-based Imitation Learning
Vision-based imitation learning employs depth-based intermediate representations for robust sim-to-real transfer, ensuring consistent action predictions. To address occlusion, it integrates semantic occupancy prediction with multi-view depth features via deformable cross-attention, enabling efficient and generalizable scene understanding. The resulting 3D scene representation is transformed into executable actions through diffusion.
Simulation Results
Parallel Simulation
Diverse Scenes
Occupancy Reconstruction (Top-Down View)
(A) RGB-D voxel method misses crucial geometric details due to occlusion, leading to collision during fetching. (B) Our method can infer the occluded region, enabling successful collision avoidance.
Real World Results
Comparison With Baselines
Our approach achieves strong sim-to-real performance, outperforming methods that rely on RGB (DP) or PointCloud (DP3) representations.
Generalize to Diverse Obstacles
Generalize to Diverse Environments
Storage Rack-1
Storage Rack-2
Cabinet-1
Cabinet-2
Dynamic Cases
Dynamic Scenario-1
Dynamic Scenario-2
Broader Application
Tabletop Scenario
Drawer Scenario
Our method can also be extended to other tasks. Real-robot extension experiments showing successful fetching from cluttered tabletop (suction) and drawer (parallel gripper) settings.
Occ Prediction Results
Real-world occupancy reconstruction results, capable of handling diverse scenarios: varying object shapes, different materials, and complex layouts.
Occupancy vs. RGB-D
Comparison in the real world. Direct RGB-D voxelization often yields incomplete scenes in real-world settings, while our occupancy method (Occ) generates more complete representations.
Note: Importantly, we do not feed the reconstructed voxels (final occupancy map) to the downstream policy. Instead, the policy receives intermediate latent features from the pre-trained encoder; although it never observes the final occupancy prediction, these auxiliary-task–enriched features allow it to implicitly capture the scene’s complete 3D geometry.
Additional Demos with Corresponding Observations
Our Team
2CFCS, School of Computer Science, Peking University,
3Galbot, 4Beijing Academy of Artificial Intelligence
* Equal contributions † Corresponding authors
@misc{liu2025fetchbotlearninggeneralizableobject,
title={FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real},
author={Weiheng Liu and Yuxuan Wan and Jilong Wang and Yuxuan Kuang and Wenbo Cui and Xuesong Shi and Haoran Li and Dongbin Zhao and Zhizheng Zhang and He Wang},
year={2025},
eprint={2502.17894},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2502.17894},
}
If you have any questions, please contact Weiheng Liu and Yuxuan Wan.