| CARVIEW |
Selective Visual Representations Improve Convergence and Generalization for Embodied-AI
*Equal Contribution
ArXiv Code Slides
Abstract
Embodied-AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans—the process through which people filter their perception based on their experiences, knowledge, and the task at hand—we introduce a parameter-efficient approach to filter visual stimuli for embodied-AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for Object Goal Navigation and Object Displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able to generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects.
The Codebook Module
A Filtering Mechanism of Visual Representations for Embodied-AI
Conventional embodied-AI frameworks usually employ general-purpose visual backbones like CLIP to extract the
visual representations from the input. Such representations capture an abundance of details and a
significant amount of task-irrelevant information.
For example, to find a specific object in a house, the agent doesn't need to know about other distractor
objects in the agent’s view, about their colors, materials, attributes, etc. These distractions introduce
unnecessary noise into the learning process, distracting the agent’s focus away from more pertinent visual cues.
We draw from the substantial body of research in cognitive psychology to induce selective task-specific
representations that filter irrelevant sensory input and only retain the necessary stimuli.
We introduce a compact learnable module that decouples the two objectives in embodied-AI tasks across different
parameters in the network:
- The input encoders and the codebook focus on extracting salient information for the task from the visual input
- Whereas the policy (RNN and actor-critic heads) can focus on decision-making based on this filtered information.
Results
Codebook-Based Representations Improve Performance in Embodied-AI
Bottlenecking the task-conditioned embeddings using our codebook module results in significant improvements over the non-bottlenecked representations across a variety of Embodied-AI benchmarks. We consider Object Goal Navigation (navigate to find a specific object category in a scene) and Object Displacement (bringing a source object to a destination object using a robotic arm) across 5 benchmarks (ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR).
Introducing New Metrics for Object Navigation. We present Curvature (k) defined as
We further introduce Success Weighted by Episode Length (SEL),
| Benchmark (Object Goal Navigation) | Model | SR(%) ▲ | EL ▼ | Curvature ▼ | SEL ▲ |
|---|---|---|---|---|---|
| ProcTHOR-10k (test) | EmbCLIP +Codebook (Ours) |
67.70 73.72 |
182.00 136.00 |
0.58 0.23 |
36.00 43.69 |
| ArchitecTHOR (0-shot) | EmbCLIP +Codebook (Ours) |
55.80 58.33 |
222.00 174.00 |
0.49 0.20 |
20.57 28.31 |
| AI2-THOR (0-shot) | EmbCLIP +Codebook (Ours) |
70.00 78.40 |
121.00 86.00 |
0.29 0.16 |
21.45 26.76 |
| RoboTHOR (0-shot) | EmbCLIP +Codebook (Ours) |
51.32 55.00 |
- - |
- - |
- - |
| Benchmark (Object Displacement) | Model | PU(%) ▲ | SR(%) ▲ |
|---|---|---|---|
| ManipulaTHOR | m-VOLE +Codebook (Ours) |
81.20 86.00 |
59.60 65.10 |
Codebook-Bottlenecked Embedding is Easier to Transfer to New Visual Domains
The codebook-based embedding transfers across new visual domains without exhaustive fine-tuning. Our codebook bottleneck effectively decouples the process of learning salient visual information for the task from the process of decision-making based on this filtered information. Consequently, when faced with a similar task in a new visual domain, the need for adaptation is significantly reduced. In this scenario, only the modules responsible for extracting essential visual cues in the new domain require fine-tuning, while the decision-making modules can remain fixed. We show our ObjectNav agent trained in AI2THOR simulator can effectively adapt to the Habitat simulator (which is very different in visual characteristics, lighting, textures and other environmental factors) by merely fine-tuning a lightweight Adaptation Module.
Lightweight Fine-tuning of the Adaptation Module. We only finetune a few CNN layers, action and goal embedders, and the codebook scoring function when moving to a new visual domain.
Codebook Encodes Only the Most Important Information to the Task
We conduct an analysis (through linear probing, GradCAM attention visualization, and nearest-neighbor retrieval) to explore the information encapsulated within our bottlenecked representations after training for Object Goal Navigation task. The results show that our codebook-bottlenecked representations effectively exclude information related to distracting visual cues and object categories other than the specified goal while only concentrating on the target object and encoding better information about object goal visibility and proximity to the agent.
GradCAM Attention Visualization. While EmbCLIP ObjectNav agent is distracted by different objects and other visual cues even though the target object is visible in the frame, the codebook module helps the agent to effectively ignore such distractions and only focus on the object goal.
Nearest-Neighbor Retrieval in the Goal-Conditioned Embedding Space. The 4 examples show that EmbCLIP-Codebook prioritizes task semantics while EmbCLIP focuses on scene semantics. In the top row, our nearest neighbors are based on object goal visibility and goal proximity to the agent whereas EmbCLIP nearest neighbors are based on the overall semantics of the scene (tables in the left or toilets far away). In the bottom row, our nearest neighbors favor the overall scene layout whereas EmbCLIP mostly focuses on colors and appearances.
Our Agent Explores More Effectively and Travels in Smoother Trajectories
We conduct a quantitative and qualitative analysis to compare the agent's behavior. The Curvature and Success Weighted by Episode Length (SEL) metrics show that our agent explores more effectively and travels in much smoother paths. Excessive rotations and sudden changes in direction can lead to increased energy consumption and increase the chances of collisions with other objects. Lower SEL achieved by our agent shows that we can find the target object in much fewer steps. The qualitative examples below show that the baseline agent performs lots of redundant rotations.
EmbCLIP-Codebook
EmbCLIP
EmbCLIP-Codebook
EmbCLIP
EmbCLIP-Codebook
EmbCLIP
Our agent explores the environment much more effectively and travels in much smoother trajectories. Whereas the EmbCLIP baseline agent makes many redundant rotations.
BibTeX
@article{eftekhar2023selective,
title={Selective Visual Representations Improve Convergence and Generalization for Embodied AI},
author={Eftekhar, Ainaz and Zeng, Kuo-Hao and Duan, Jiafei and Farhadi, Ali and Kembhavi, Ani and Krishna, Ranjay},
journal={arXiv preprint arXiv:2311.04193},
year={2023}
}