| CARVIEW |
Learning to Act from Actionless Videos
through Dense Correspondences


Framework Overview
- (a) Our model takes the RGBD observation of the current environmental state and a textual goal description as its input.
- (b) It first synthesizes a video of imagined execution of the task using a diffusion model.
- (c) Next, it estimates the optical flow between adjacent frames in the video.
- (d) Finally, it leverages the optical flow as dense correspondences between frames and the depth of the first frame to compute SE(3) transformations of the target object, and subsequently, robot arm commands.
Real-World Franka Emika Panda Arm with Bridge Dataset
We train our video generation model on the Bridge data (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment.
Synthesized Videos






























Robot Executions
Meta-World
We train our video generation model on 165 videos of 11 tasks. We evaluate on robot manipulation tasks in Meta-World (Yu et al., 2019) simulated benchmark.
Synthesized Videos





























Robot Executions
iTHOR
We train our video generation model on 240 videos of 12 target objects. We evaluate on object navigation tasks in iTHOR (Kolve et al., 2017) simulated benchmark.
Synthesized Videos
































Robot Navigation
Cross-Embodiment Learning (Visual Pusher)
We train our video generation model on ~200 actionless human pushing videos and evaluate in Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) robot environment.
Failed executions
Successful executions
Zero-Shot Generalization on Real-World Scene with Bridge Model
We show that our video diffusion model trained on Bridge data (mostly toy kitchen) already can generalize to complex real-world kitchen scenarios. Note that the videos are blurry since the original video resolution is low (48x64).
Extended Analysis and Ablation Studies
Comparison of First-Frame Conditioning Strategy
We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.
Improving Inference Efficiency with
Denoising Diffusion Implicit Models
This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.



















































































































