| CARVIEW |
Abstract
Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.
DrivingDojo Constitution
| Dataset | Videos | Type | Camera | Ego Trajectory | Text Description |
|---|---|---|---|---|---|
| DrivingDojo | 18.2k | total | ✓ | ✓ | ✓ |
| DrivingDojo-Action | 7.9k | rich ego-actions | ✓ | ✓ | |
| DrivingDojo-Interplay | 6.4k | multi-agent interplay | ✓ | ✓ | |
| DrivingDojo-Open | 3.9k | open-world knowledge | ✓ | ✓ | ✓ |
DrivingDojo-Action
To enable the world model to generate an infinite number of high-fidelity, action-controllable virtual driving environments, we create a subset called DrivingDojo-Action that features a balanced distribution of driving maneuvers. This subset includes a diverse range of both longitudinal maneuvers, such as acceleration, deceleration, emergency braking, and stop-and-go driving, as well as lateral maneuvers, including lane-changing and lane-keeping.
Turn Left
Go Straight
Turn Right
Lane Change
Emergency Braking
DrivingDojo-Interplay
We design the DrivingDojo-Interplay subset focusing on interactions with dynamic agents as a core component of the dataset. We curate this subset to include at least one of the following driving scenarios: cutting in/off, meeting, blocked, overtaking, and being overtaken.
DrivingDojo-Open
We place a unique emphasis on including rich open-world knowledge video clips and construct the DrivingDojo-Open subset. Describing open-world driving knowledge is challenging due to its complexity and variability, but these scenarios are crucial for ensuring safe driving.
Action Instruction Following (AIF)
We propose the action instruction following (AIF) errors to measure the consistency between the generated video and the input action conditions.
Generation Demos
We show the model generation demos trained on the DrivingDojo dataset. Our model can generate high-resolution, complex driving scenarios.
Diverse-action Generation
Crossing
Lane Changing
Diverse-scene Generation
Multi-agent Interplay
Out-of-domain Generation
Open-world Generation
Other Modality Condition
Acknowledgements
This work was supported in part by the National Key R&D Program of China (No. 2022ZD0116500), the National Natural Science Foundation of China (No. U21B2042, No. 62320106010), and in part by the 2035 Innovation Program of CAS, and the InnoHK program.