| CARVIEW |
VideoAgent: Self-Improving Video Generation
*Equal contribution
Abstract
Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self- conditioning consistency, utilizing feedback from a pretrained vision-language model (VLM). As the refined video plan is being executed, VideoAgent collects additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robotics can be an effective tool in grounding video generation in the physical world.
Quantitative Results
Meta-World Results
Meta-World Results. The mean success rates of baselines and VideoAgent on 11 simulated robot manipulation environments from Meta-World. VideoAgent consistently outperforms baselines across all tasks.
| Task | AVDC | AVDC-Replan | VideoAgent | VideoAgent-Online (Iter1) | VideoAgent-Online (Iter2) | VideoAgent-Online-Replan |
|---|---|---|---|---|---|---|
| Door Open | 30.7% | 72.0% | 40.0% | 41.3% | 44.0% | 80.0% |
| Door Close | 28.0% | 89.3% | 29.3% | 32.0% | 29.3% | 97.3% |
| Basketball | 21.3% | 37.3% | 13.3% | 17.3% | 18.7% | 40.0% |
| Shelf Place | 8.0% | 18.7% | 9.3% | 12.0% | 18.7% | 22.7% |
| Button Press | 34.7% | 60.0% | 38.7% | 45.3% | 46.7% | 72.0% |
| Button Press Topdown | 17.3% | 24.0% | 18.7% | 14.7% | 16.0% | 40.0% |
| Faucet Close | 12.0% | 53.3% | 46.7% | 38.7% | 49.3% | 58.7% |
| Faucet Open | 17.3% | 24.0% | 12.0% | 13.3% | 21.3% | 36.0% |
| Handle Press | 41.3% | 81.3% | 36.0% | 36.0% | 44.0% | 85.3% |
| Hammer | 0.0% | 8.0% | 0.0% | 0.0% | 1.3% | 8.0% |
| Assembly | 5.3% | 6.7% | 1.3% | 4.0% | 1.3% | 10.7% |
| Overall | 19.6% | 43.1% | 22.3% | 23.2% | 26.4% | 50.0% |
iThor Success Rates
| Room | AVDC Baseline | VideoAgent (Ours) |
|---|---|---|
| Kitchen | 26.7% | 28.3% |
| Living Room | 23.3% | 26.7% |
| Bedroom | 38.3% | 41.7% |
| Bathroom | 36.7% | 40.0% |
| Overall | 31.3% | 34.2% |
BridgeData-V2 Results
| Metrics | AVDC | Video Agent (Ours) |
|---|---|---|
| Clip Score | 22.39 | 22.90 |
| Flow Consistency | 2.48 ± 0.00 | 2.59 ± 0.01 |
| Visual Quality | 1.97 ± 0.003 | 2.01 ± 0.003 |
| Temporal Consistency | 1.48 ± 0.01 | 1.55 ± 0.01 |
| Dynamic Degree | 3.08 ± 0.01 | 3.07 ± 0.02 |
| Text to Video Alignment | 2.26 ± 0.003 | 2.30 ± 0.03 |
| Factual Consistency | 2.02 ± 0.004 | 2.07 ± 0.01 |
| Average Video Score | 2.16 ± 0.01 | 2.20 ± 0.01 |
| Human Eval on Task Success | 42.0% | 64.0% |
Qualitative Results
Meta-World Qualitative Results
Synthesized Videos
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Base video
VideoAgent
VideoAgent-online
VideoAgent-suggestive
Environment 1: Door Open
Robot Executions
Task: door-close
Task: door-open
Task: button-press-topdown
Task: hammer
iThor Qualitative Results
Robot Executions
Bridge Qualitative Results
Synthesized Videos from AVDC
Synthesized Videos with Refinement using VideoAgent
Task Description: Put Banana in Colander
Effect of Different Feedback
Effect of Refinement Iterations
Effect of Online Iterations