Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language

System Fundamentals

Here we describe V-IRL's hierarchical architecture that transforms real cities around the world into a vast virtual playground in which agents can be constructed to solve practical tasks. The platform lies at the foundation—providing the underlying components and infrastructure for agents to employ. Higher level capabilities of Perception, Reasoning, Action, and Collaboration emerge from the platform's components. Finally, agents leverage these capabilities along with user-defined metadata in task-specific run() routines to solve tasks.

Click on any individual module to view its detailed description

Architecture figure
Hierarchical V-IRL architecture.

V-IRL Benchmark

The essential attributes of V-IRL include its ability to access geographically diverse data derived from real-world sensory input, and its API that facilitates interaction with Google Map Platform (GMP) . This enables us to develop three V-IRL benchmarks to assess the capabilities of existing vision models in such open-world data distribution.

V-IRL Place: Localization

Motivation: Every day, humans traverse through cities, moving between diverse places to fulfil a range of goals, like the Intentional Explorer agent. We assess the performance of vision models on the everyday human activity place localization using street view imagery and associated place data.

Setups: We modify RX-399 agent to traverse polygonal areas while localizing & identifying 20 types of places. We evaluate three prominent open-world detection models: GroundingDINO , GLIP , Owl-ViT , OpenSeeD and Owl-ViT v2 . We also implement a straightforward baseline, CLIP (w/ GLIP proposal), which involves reclassifying the categories of GLIP proposals with CLIP . Models are evaluated on localization recall, which is quantified as N tp N tp + N fn , where N tp and N fn represents the number of correctly localized places and missed places, respectively.

V-IRL Place Localization
Matching between 2D object proposal and street place. we first project the bounding box of each object proposal onto a frustum in the 3D space, subject to a radius. We then determine if any nearby places fall within this frustum and radius. If any nearby place is found, the closest one is assigned as the ground truth for the object proposal. Otherwise, the object proposal is regarded as a false positive. When multiple places are inside the frustum, we consider the nearest one as the ground truth since it would likely block the others in the image.

Results: Following table shows that open-world detectors like GroundingDINO , Owl-ViT and GLIP are biased towards certain place types such as school, cafe, and convenience store, respectively. In contrast, CLIP (w/ GLIP proposal) can identify a broader spectrum of place types. This is mainly caused by the category bias in object detection datasets with a limited vocabulary. Hence, even if detectors like Owl-ViT are initialized with CLIP, their vocabulary space narrows down due to fine-tuning. These results suggest that cascading category-agnostic object proposals to zero-shot recognizers appears promising for "real" open-world localization, especially for less common categories in object detection datasets.

V-IRL Place Localization Results
Benchmark results on V-IRL Place Localization. AR 10 and AR 20 denote average recall on subsampled 10 and all 20 place categories, respectively. More results in paper.
Part of V-IRL Place localization benchmark results via CLIP (w/ GLIP proposal).

V-IRL Place: Recognition and VQA

Motivation: In contrast to the challenging V-IRL place localization task on street view imagery, in real life, humans can recognize businesses by taking a closer, place-centric look. In this regard, we assess existing vision models on two perception tasks based on place-centric images: i) recognizing specific place types; ii) identifying human intentions by Vision Question Answering (VQA), named intention VQA.

Setups: For recognition, we assess 10 open-world recognition models, for place type recognition from 96 options, using place-centric images (see below imagery illustration).

Street view imagery (left), sourced from the Google Street View database, are taken from a street-level perspective, encompassing a broad view of the surroundings, including multiple buildings. Place-centric imagery (right), drawn from the Google Place database, focus predominantly on the specific place, providing a more concentrated view.
Street view imagery (left) vs place-centric imagery (right).
Street view imagery (left) vs place-centric imagery (right).
Street view imagery (left) vs place-centric imagery (right).

For intention VQA, we also evaluate 13 multi-modal large language models (MM-LLM) to determine viable human intentions from a four-option multiple-choice VQA. The V-RL Place VQA process is illustrated in following image, where the candidate and true choices are generated by GPT-4 given the place types and place names corresponding to the image.

V-IRL Place VQA
Example of V-RL Place VQA process.

Results: Following table shows that CLIP (L/14@336px) outperforms even the biggest version of Eva-02-CLIP and SigLIP in the V-RL Place recognition task, emphasizing the high-quality data of CLIP. The bottom of the table shows that LLaVA-NeXT (7B) outperforms its predecessors LLaVA-1.5 and 1.0, but still has over 8% gap to InternVL-1.5 with 26B parameters. Closed-source MLLMs GPT-4V and Qwen-VL-Max yield outstanding performance compared to most open-sourced models. We note that even these top-performing MLLMs (e.g. GPT-4V and Qwen-VL-Max) still suffer from inconsistent issues during the circular evaluation . Moreover, vision models perform better on place VQA over place-type recognition, suggesting direct prompts about human intention could be more effective for intention-driven tasks.

V-IRL Place VQA
Benchmark results on V-RL Place recognition and V-RL Place VQA. Green indicates increased resolution models, while Blue denotes model parameter scaling.

V-IRL Vision Language Navigation

Motivation: As discussed in the V-IRL agents section, Intentional Explorer and Tourist agents require collaboration between vision models and language models to accomplish complex tasks. Therefore, this motivates us to investigate the performance of vision-language collaboration, with environmental information acquired through visual perception models from real-world images. This prompts us to build an embodied task for jointly leveraging vision and language models along with the realistic street views in V-IRL. In this regard, we build this V-IRL Vision Language Navigation (VLN) benchmark.

Setups: We adapt the Tourist agent implementation and replace its recognition component with the various benchmarked models. These methods are tasked to identify visual landmarks during navigation. Subsequently, GPT-4 predicts the next action according to the recognition results. Navigation instructions are generated using the Local agent.
Four approaches are evaluated to recognize landmarks during navigation: (i) Approximate oracle by searching nearby landmarks; (ii) Zero-shot recognizers CLIP and EVA-02-CLIP ; (iii) Multi-modal LLM LLaVA-1.5 (iv) OCR model to recognize potential text in street views followed by GPT answer parsing.

Results: Following table shows that, with oracle landmark information, powerful LLMs can impressively comprehend navigation instructions and thus make accurate decisions. However, when using vision models to fetch landmark information from street views, the success rate drops dramatically, suggesting that the perception of vision models is noisy and misguides LLMs' decision making. Among these recognizers, larger variants of CLIP and EVA-02-CLIP perform better, highlighting the benefits of model scaling. LLaVA-1.5 shows inferior performance with CLIP (L/14@336px) as its vision encoder, possibly due to the alignment tax during instruction tuning. Further, PP-OCR (+ GPT-3.5) achieves a 28% success rate, signifying that OCR is crucial for visual landmark recognition.

V-IRL VLN
Results on V-IRL VLN-mini. We test various CLIP-based models, MM-LLM, and OCR model with GPT postprocessing. We primarily measure navigation success rate (Success). In addition, as navigation success is mainly influenced by the agent's actions at key positions (i.e., start positions, intersections and stop positions), we also evaluate the arrival ratio (Arr) and reaction accuracy (Reac) for each route. Arr denotes the percentage of key positions reached, while Reac measures the accuracy of the agent's action predictions at these key positions. Full-set results on CLIP and Oracle are available in paper appendix.

Geographic Diversity

Spanning 12 cities across the globe, our V-IRL benchmarks provide an opportunity to analyze the inherent model biases in different regions. As depicted in the following figure, vision models demonstrate subpar performance on all three benchmark tasks in Lagos, Tokyo, Hong Kong, and Buenos Aires. In Lagos, vision models might struggle due to its non-traditional street views relative to more developed cities (see street views in aside figures). For cities like Tokyo, Hong Kong and Buenos Aires, an intriguing observation is their primary use of non-English languages in street views. This suggests that existing vision models face challenges with multilingual image data.

City level analysis
City-level visualization of V-IRL benchmark results.

Discussion: Ethics & Privacy

Our platform serves as a tool for AI development and as a crucible for ethical discourse and preparation. As AI is inevitably being integrated into society—e.g., via augmented reality wearables or robots navigating city streets—it is imperative to confront and discuss ethical and privacy concerns now. Unlike these impending real-time systems, the data accessed by V-IRL is "stale" and often preprocessed—providing a controlled environment to study these concerns.

Notably, V-IRL exclusively utilizes preexisting, readily available APIs; it does not capture or make available any previously inaccessible data. Our primary source of street-view imagery, Google Maps, is subject to major privacy-protection measures, including blurring faces and license plates . Moreover, V-IRL complies with the Google Maps Platform license, similarly to notable existing works that also leverage Google's street views .

We believe V-IRL is an invaluable tool for researching bias. As discussed in geographic diversity, V-IRL's global scale provides a lens to study linguistic, cultural, and other geographic biases inherent in models. By using V-IRL to study such questions, we aim to preemptively tackle the ethical dilemmas that will arise with deploying real-time systems rather than being blindsided by them. We hope our work helps spur proactive discussion of future challenges throughout the community.

Conclusion

In this work, we introduce V-IRL, an open-source platform designed to bridge the sensory gap between the digital and physical worlds, enabling AI agents to interact with the real world in a virtual yet realistic environment. Through V-IRL, agents can develop rich sensory grounding and perception, utilizing real geospatial data and street-view imagery. We demonstrate the platform's versatility by creating diverse exemplar agents and developing benchmarks measuring the performance of foundational language and vision models on open-world visual data from across the globe.

This platform opens new avenues for advancing AI capabilities in perception, decision-making, and real-world data interaction. As spatial computing and robotic systems become increasingly prevalent, the demand for and possibilities of AI agents will only grow. From personal assistants to practical applications like urban planning to life-changing tools for the visually impaired, we hope V-IRL helps usher in a new era of perceptually grounded agents.

BibTeX

@inproceedings{yang2024virl,
  title={{V-IRL: Grounding Virtual Intelligence in Real Life}},
  author={Yang, Jihan and Ding, Runyu and Brown, Ellis and Qi, Xiaojuan and Xie, Saining},
  year={2024},
  booktitle={European conference on computer vision},
}