| CARVIEW |
Attention Maps
We condition the VOT on the noiseless action taken by the agent. Inspecting the attention maps, we find that different actions prime the VOT to attend to meaningful regions in the image. For instance, turning left leads to the model focusing on regions present at both time steps (see below). This makes intuitive sense, as a turning action of 30° strongly displaces visual features or even pushes them out of the agent’s field of view. A similar behavior emerges for moving forward, which leads to attending on the center regions, e.g., the walls and the end of a hallway (see below).
Hint: Drag the slider to overlay the attention map over the observations.
Habitat Challenge
We submit our VOT (RGB-D) to the Habitat Challenge 2021 benchmark (test-std split). Using the same navigation policy as Rank 2, we achieve the highest SSPL training on only 5% of the data. (Leaderboard)
| Rank | Participant team | S | SPL | SSPL |
|---|---|---|---|---|
| 1 | MultiModalVO (VOT) (ours) | 93 | 74 | 77 |
| 2 | VO for Realistic PointGoal | 94 | 74 | 76 |
| 3 | inspir.ai robotics | 91 | 70 | 71 |
| 4 | VO2021 | 78 | 59 | 69 |
| 5 | Differentiable SLAM-net | 65 | 47 | 60 |
BibTeX
@inproceedings{memmel2023modality,
title={Modality-invariant Visual Odometry for Embodied Vision},
author={Memmel, Marius and Bachmann, Roman and Zamir, Amir},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={21549--21559},
year={2023}
}