Interpret human demonstration videos and generate robot action plans using a pipeline of keyframe selection, visual perception and vision language model reasoning.
Method
Module 1: Keyframe Selection We use APIs from the MediaPipe to detect the hand keypoints and calculate the speed of the hand.
The speed plot is then interpolated to be continous and the valleys are used as the keyframe selections.
Data and demos
We collected a dataset of human demonstration videos in three diverse catogories: vegetable organization, garment organization, and wooden block stacking.
Below are the data and corresponding results.
Human demonstration
Here is a demonstration video for vegetable organization. The videos illustrate how a human arrange the vegetable toys into specific containers one by one.
Robot execution
In this video, the robot executes the vegetable organization task in the same order as the human demonstrates.
BibTeX
Coming Soon
Acknowledgements
The work was supported in part through NSF grants 2238968, 2322242, and 2024882, and the NYU IT High Performance Computing resources, services, and staff expertise.