| CARVIEW |
Solving New Tasks by Adapting Internet Video Knowledge
ICLR 2025
Abstract
Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.
Adaptation Techniques
We explore three different adaptation techniques: Subject Customization, Probabilistic Adaptation, and Direct Finetuning. Subject Customization only modifies the image and text encoder, rather than the motion module, and is lightweight in terms of data requirements: it only utilizes pairs of static images and text annotated with a special identifier. Probabilistic Adaptation learns a small in-domain model from paired video data, which is then used through score composition with a large-scale video model that is kept frozen. The small in-domain model can be flexibly parameterized to consider available training resources. Direct Finetuning seeks to update the motion module of the large-scale video model with in-domain paired video data.
Experiments
We evaluate how adapted video models can enable text-conditioned generalization via two approaches: visual planning and policy supervision. For visual planning, the adapted video model synthesizes a text-conditioned video plan into the future, which is then converted into actions to follow. In policy supervision, the adapted video model is used in a discriminative manner to evaluate frames achieved by the policy; these are converted into text-conditioned rewards, which the policy is optimized to maximize. Below we visualize the actual rollouts during environment interaction.
Policy Supervision
"a robot arm opening a door"
AnimateDiff (Vanilla)
"a [D] robot arm opening a door"
Subject Customization
"a robot arm opening a door"
Direct Finetuning
"a robot arm opening a door" | "door open"
Probabilistic Adaptation
"a robot arm opening a door" | "door open"
Inverse Probablistic Adaptation
"an action figure walking"
AnimateDiff (Vanilla)
"a [D] action figure walking"
Subject Customization
"an action figure walking"
Direct Finetuning
"an action figure walking"
Probabilistic Adaptation
"a dog walking"
AnimateDiff (Vanilla)
"a [D] dog walking"
Subject Customization
"a dog walking"
Direct Finetuning
"a dog walking"
Probabilistic Adaptation
Visual Planning
"a robot arm pushing a button"
AnimateDiff (Vanilla)
"button press"
In-Domain Only
"a [D] robot arm pushing a button"
Subject Customization
"a robot arm pushing a button" | "button press"
Probabilistic Adaptation
"a robot arm pushing a button" | "button press"
Inverse Probabilistic Adaptation
"a robot arm closing a drawer"
AnimateDiff (Vanilla)
"drawer close"
In-Domain Only
"a [D] robot arm closing a drawer"
Subject Customization
"a robot arm closing a drawer" | "drawer close"
Probabilistic Adaptation
"a robot arm closing a drawer" | "drawer close"
Inverse Probabilistic Adaptation
Visual Planning with Suboptimal Data
"drawer close"
In-Domain Only
"a robot arm closing a drawer" | "drawer close"
Probabilistic Adaptation
"a robot arm closing a drawer" | "drawer close"
Inverse Probabilistic Adaptation
"window close"
In-Domain Only
"a robot arm closing a window" | "window close"
Probabilistic Adaptation
"a robot arm closing a window" | "window close"
Inverse Probabilistic Adaptation
Quantitative Results on MetaWorld
We discover that our proposed Inverse Probabilistic Adaptation can serve as a strong adaptation technique across different task and evaluation settings, which remains robust when only suboptimal demonstrations are available.
Novel Text-Conditioned Generalization
We visualize the free-form video generated by adapted video models, conditioned on a novel text-prompt ("a dog jumping") that was unseen during adaptation. When using the adapted video model for policy supervision (simply as a critic that provides text-conditioned rewards), we showcase that it can successfully supervise a downstream Dog agent to behave according to this novel text specification in a zero-shot manner.
"a dog jumping"
Free-form Generation (Direct Finetuning)
"a dog jumping"
Policy Rollout (Direct Finetuning)
BibTeX
@inproceedings{
luo2025solving,
title={Solving New Tasks by Adapting Internet Video Knowledge},
author={Calvin Luo and Zilai Zeng and Yilun Du and Chen Sun},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}