| CARVIEW |
Abstract
Our approach involves 3 steps -
#1 : Pre-training a world model on human videos,
#2 : Finetuning the world model on unsupervised robot data, and
#3 : Using the finetuned model to plan to achieve goals
Step #1 - Pre-training World Model on Human Videos
We use a shared human-robot high level action space by leveraging affordances. These specify interaction points and post-contact trajectory, following our prior work. Our action space is flexible enough to also support actions outside this shared space.
Step #2 - Finetuning on unsupervised robot data
The robot samples from the affordance space to collect data for finetuning the world model. This data collection is unsupervised, since there is no task reward.
Step #3 - Multi-Task Deployment
With the finetuned world model, we can solve tasks using planning. In our experiments we specify tasks using goal images.
Effect of Pre-training
Pre-training on human videos significantly improves performance, especially when using a world model jointly across multiple tasks, where average task success increases from 20 to 80 percent.
BibTeX
@inproceedings{mendonca2023structured,
title={Structured World Models from Human Videos},
author={Mendonca, Russell and Bahl, Shikhar and Pathak, Deepak},
journal={RSS},
year={2023}
}
Acknowledgements
We thank Shagun Uppal and Murtaza Dalal for feedback on early drafts of this manuscript.This work is supported by the Sony Faculty Research Award and ONR N00014-22-1-2096