| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Sat, 17 Jun 2023 00:42:36 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"648d017c-3820"
expires: Sun, 28 Dec 2025 19:10:50 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 4190:2BC55:7EE475:8E4C40:69517E61
accept-ranges: bytes
age: 0
date: Sun, 28 Dec 2025 19:00:50 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210044-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766948450.050011,VS0,VE213
vary: Accept-Encoding
x-fastly-request-id: 0bd0634210aee5ced76ed5da88e58ac18236d025
content-length: 3882
Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
Abstract
| Xindi Wu 1* | KwunFung Lau 1 | Francesco Ferroni 2 | Aljoša Ošep 1 | Deva Ramanan 1, 2 |
|
|
|
|
|
|
|
|
| 🌐 CVPR 2023 | 📄 Paper | 🖼️ Poster | 📽 Video |
Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.
Task overview
Problem: Infer topological road maps from images.
Challenges: Learning to map continuous images to discrete graphs (maps) with varying numbers of nodes and topology in bird’s eye view (BEV) is difficult.
Prior works: (jointly) learn a non-linear mapping from image pixels to BEV, and estimate the road layout by generating a discrete spatial graph from detected lane markings.
Our approach:
Pix2Map returns the graph with the embedding most similar to input image via cross-modal retrieval
Pix2Map: The graph encoder (bottom) computes a graph embedding vector \( \phi_{\text{graph}} \) for each street map in a batch. The image encoder, (top) outputs an image embedding \( \phi_{\text{image}} \) for each corresponding image stack. We then build a similarity matrix for a batch, that contrasts the image and graph embeddings. We highlight that the adjacency matrix of a given graph is used as the attention mask for our transformer-based graph encoder.
Results
Baseline comparisons. For fair comparisons with the prior art[1], in this experiment, we (i) train Pix2Map using frontal \(50m \times 50m\) road-graphs (as opposed to our default setting of predicting the surrounding 40m x 40m area). Moreover, we (ii) train Pix2Map with a single frontal view (Pix2Map-Single) to ensure consistent comparisons to baselines. Importantly, even in this setting, our method still outperforms baselines by a large margin 2.6819 in terms of Chamfer distance, as compared to 3.0140 obtained by the closest competitor, TOPO-TR[1].
Video presentation
|
Xindi Wu, KwunFung Lau, Francesco Ferroni, Aljoša Ošep, Deva Ramanan
Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
CVPR 2023
📄 Paper
📽 Video
🖼️ Poster
|