| CARVIEW |
DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation
3 J.P. Morgan AI Research, 4 Carnegie Mellon University, 5 NVIDIA
*Indicates Equal Contribution
CoRL 2025 Best Paper Final List
Abstract
We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI includes hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands.The hardware adaptation bridges the kinematics gap using a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion.The software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting.We demonstrate DexUMI's capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%..
Introduction to DexUMI
Hardware Design
XHand exoskeleton
Inspire Hand exoskeleton
Capability Experiments
DexUMI experiment video. Please check complete evaluations below.
Tea Picking with Tool
Task: Grasp tweezers from the table and use them to transfer tea leaves from a teapot to a cup. The main challenge is to stably and precisely operate the deformable tweezers with multi-finger contacts.
Hardware: XHand and Inspire Hand.
Ours(Xhand)
Ours(Inspire)
Cube Picking
Task: Pick up a 2.5cm wide cube from a table and place it into a cup. This evaluates the basic capabilities and precision of the DexUMI system.
Ablation: We compare the form of finger action trajectory: absolute position or relative trajectory. Notice, We always use relative position for wrist action.
Hardware: Inspire Hand.
Ours
Absolute finger action trajectory
Kitchen Manipulation
Task: The task involves four sequential steps: turn off the stove knob; transfer the pan from the stove top to the counter; pick up salt from a container; and lastly, sprinkle it over the food in the pan. The task tests DexUMI's capability over long-horizon tasks with precise actions, tactile sensing and skills beyond using fingertips (utilizing the sides of fingers for stable pan handling).
Ablation: The wearable exoskeleton allows users to directly contact objects and receive haptic feedback. However, this human haptic feedback cannot be directly transferred to the robotic dexterous hand. Therefore, we install tactile sensors on the exoskeleton to capture and translate these tactile interactions. We compare the policies trained with and without tactile sensor input.
Hardware: XHand.
Ours
No Tactile Sensor
Egg Carton
Task: Open an egg carton with multiple fingers: the hand needs the index, middle, ring, and little fingers to apply downward pressure on the carton's top while simultaneously using the thumb to lift the front latch. The task evaluates multi-finger coordination
Ablation: DexUMI developed a software adaption pipeline to bridge the visual gap between policy training and robot depolyment. To test whether software adaption pipeline is crucial to our framework, we test the policy training without software adaption and replaces eplaces pixels occupied by the exoskeleton (during training) or robot hand (during inference) with a green color mask.
Hardware: Inspire Hand.
Ours
Without Software Adaption
Efficiency Comparison
DexUMI offers two key advantages over teleoperation: 1) DexUMI is significantly more efficient than traditional teleoperation methods, and 2) DexUMI provides direct haptic feedback, which typical teleoperation systems often fail to deliver.Inpaint Results
We show the exoskeleton data and inpainted video side by side to demonstrate our software adaptation layer capability. Our software adaptation bridges the visual gap by replacing the human hand and exoskeleton in visual observation recorded by the wrist camera with high-fidelity robot hand inpainting. Though the overall inpainting quality is good, we found there are still some deficits in the output caused by:
- 1. Imperfect Segmentation from SAM2: In most cases, we found SAM2 (Ravi et al., 2024) can segment the human hand and exoskeleton pretty well. However, we notice, SAM2 sometimes misses some small areas on the exoskeleton.
- 2. Quality of inpainting method: We use flow-based inpainting ProPainter (Zhou et al., 2023) to replace the human and exoskeleton pixels with background pixels. Though the overall quality is high, there are some areas still blurry.
- 3. Robot hand hardware: Throughout our experiments, we found that both the Inspire Hand and XHand lack sufficient precision due to backlash and friction. For example, the fingertip location of the Inspire Hand differs when moving from 1000 to 500 motor units compared to moving from 0 to 500 motor units. Consequently, when fitting regression models between encoder and hand motor values, we can typically ensure precision in only "one direction"—either when closing the hand or opening it. This inevitably causes minor discrepancies in the inpainting and action mapping processes.
- 4. Inconsistent illumination: Similar to prior work (Chen et al., 2024), we found that illumination on the robot hand might be inconsistent with what the robot experiences during deployment. Therefore, we add image augmentation including color jitter and random grayscale during policy training to make the learned policy less sensitive to lightingconditions.
- 5. 3D-printed exoskeleton deformation: The human hand is powerful and can sometimes cause the 3D-printed exoskeleton to deform during operation. In such cases, the encoder value fails to reflect this deformation. Consequently, the robot finger location might not align with the exoskeleton's actual finger position.
Cube Picking
Egg Carton
Tea Picking with Tool (Inspire Hand)
Tea Picking with Tool (XHand)
Kitchen Manipulation
References
- Ravi, N., Gabeur, V., Hu, Y. T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu, C. Y., Girshick, R., Dollár, P., & Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714.
- Zhou, S., Li, C., Chan, K. C. K., & Loy, C. C. (2023). ProPainter: Improving propagation and transformer for video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10477-10486).
- Chen, L. Y., Xu, C., Dharmarajan, K., Cheng, R., Keutzer, K., Tomizuka, M., Vuong, Q., & Goldberg, K. (2024). RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning. In 8th Annual Conference on Robot Learning.
BibTeX
@article{xu2025dexumi,
title={DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation},
author={Xu, Mengda and Zhang, Han and Hou, Yifan and Xu, Zhenjia and Fan, Linxi and Veloso, Manuela and Song, Shuran},
journal={arXiv preprint arXiv:2505.21864},
year={2025}
}