| CARVIEW |
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Ctrl-World is designed for policy-in-the-loop rollouts with generalist robot policies. It generates joint multi-view predictions (including wrist views), enforces fine-grained action control via frame-level conditioning, and sustains coherent long-horizon dynamics through pose-conditioned memory retrieval. Together, these components enable (1) accurate evaluation of policy instruction-following ability via imagination, and (2) targeted policy improvement on previously unseen instructions.
Interactive demos
Starting from the same initial frame, Ctrl-World can autoregressively generate diverse future trajectories conditioned on the given action chunks, achieving centimeter-level precision. You can select any action combinations and generate corresponding videos. All videos are generated by passing in the initial frame and a different sequences of actions as input. For interpretability, we translate each action chunk into a text description of the action.
Interactive Control Demo 1: Keyboard Control
Action chunk 1:
Action chunk 2:
Action chunk 3:
Generated video:
Interactive Control Demo 2: Interact with Different Object
Action chunk 1-2:
Action Chunk 3:
Action Chunk 4:
Generated video:
Interactive Control Demo 3: Centimeter-level precision
Action chunk 1:
Action chunk 2:
Action chunk 3:
Generated video:
Interactive Control Demo 4: Interact with Different Object
Action chunk 1-3
Action Chunk 4-5:
Action Chunk 6-7:
Generated video:
Comparisons on Rollouts in Real-World and World Model (Figure 6 of paper)
Execution
Pick blue block and place on white plate
Fold the towel into half
Rollout
Execution
Place sponge in drawer
Close the laptop
Rollout
Execution
Move towel from left to right.
Pull one tissue out of the box.
Rollout
Synthetic Data Used for Finetuning the Policy (Figure 8 of paper)
(e.g., left, right, top right, bottom side)
(E.g., smaller,larger block)