Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

Analysis

Comparative Analysis Across Datasets. We compare ZeroVO variants with existing baselines using standard metrics of translation, rotation, absolute trajectory, and scale errors. All methods are provided with estimated camera intrinsics and metric depth. ZeroVO+ is our model trained with further data using semi-supervision, and LiteZeroVO+ is a smaller model variant for resource-constrained settings. Our models demonstrate strong performance across metrics and datasets, particularly in metric translation estimation. As highlighted by the scale error, GTA and nuScenes contain challenging evaluation settings, including nighttime, weather variations, haze, and reflections. We note that TartanVO and DPVO baselines (in gray) only predict up-to-scale motion and use privileged information, i.e., ground-truth scale alignment in evaluation.

Ablation Analysis for Model and Training Components. We analyze various model components: Flow module (F), Depth module (D), Language prior (L), Semi-supervised training (S), and Pseudo-label Selection (P). Flow, depth, and language correspond to the proposed supervised ZeroVO model. Results with additional semi-supervised training are shown as ZeroVO+ (showing state-of-the-art performance by integrating all of our proposed components).

Qualitative Results on KITTI. We show trajectory prediction results across the four most complex driving sequences (00, 02, 05, and 08) from the KITTI dataset. Each subplot illustrates the trajectories generated by our proposed model and the baseline models alongside the ground truth trajectory. The qualitative results demonstrate that our approach achieves the highest alignment with the ground truth, particularly in challenging turns and extended straight paths. These findings highlight the robustness of our method in handling complex and diverse driving scenarios.

Qualitative Examples

GTA Dataset

We introduce a newly generated simulated dataset derived from the high-fidelity GTA simulation. Our GTA dataset consists of 922 driving sequences captured within a simulated city environment, encompassing a range of diverse weather conditions, driving speeds (particularly high-speed maneuvers not found in other public datasets), traffic scenarios, and times of day. Compared to other commonly used open-source simulation platforms such as CARLA, GTA offers several key advantages: (1) enhanced image realism through the application of reshade graphic settings that support higher quality rendering, and (2) a wider variety of road conditions across various weather scenarios. For on-road driving, these conditions include significant uphill and downhill gradients, tunnels, and underground parking facilities; for off-road driving, the environment features mountains, deserts, snow-covered terrains, and forests, thereby enabling more precise and complex rotational dynamics throughout the map.

Off-Road Desert

Foggy Forest Trail

Mountain Cliff Path

Urban Intersection (Sunny)

Highway in Rain

Nighttime Highway (Rain)

Acknowledgments

We thank the Red Hat Collaboratory (awards 2024-01- RH02, 2024-01-RH07) and National Science Foundation (IIS-2152077) for supporting this research.

BibTeX

@inproceedings{lai2025zerovo,
              title={ZeroVO: Visual Odometry with Minimal Assumptions},
              author={Lai, Lei and Yin, Zekai and Ohn-Bar, Eshed},
              booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
              pages={17092--17102},
              year={2025}
            }

Page template borrowed from CaT