| CARVIEW |
Object Insertion
We demonstrate GEM's capability to insert objects into scenes and precisely control their motion. In the following examples, we insert a new car into the scene and can even control the movement of existing cars.
Unconditional Generation
Insertion Control
Human Pose Control
GEM can use human poses to control pedestrian motion within the scene. In these examples, pedestrians either cross the street or stop according to the provided controls.
Move poses control
Static poses control
Long Generation
We compare our long generation with the only world model trained on OpenDV capable of generating long sequences. We observe that our generations have higher ego motion temporal consistency and more realistic dynamics.
GEM's Long Generation
Vista's Long Generation
Interesting Observations
We show interesting behaviors observed in the generated videos. These behaviors do not necessarily exist in the ground truth videos, but emerge from the model's learned dynamics.
Break lights go off before moving
Smooth takeover dynamics on a long generation
Multimodal
GEM generates two modalities simultaneously: RGB and Depth. We show examples of multimodal generations.
Multidomain
GEM is finetuned on two other ego centric domains and we observe it quickly adapts to these new domains.
1. Drone Flights
2. Human Egocentric
Pseudo-labeling
Below, we present visualizations demonstrating our pseudo-labeling pipeline’s capability to generate skeleton poses, depth maps, and ego-motion trajectories.