The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
Figure 1. We consolidate diverse navigation tasks into a unified language-guided navigation framework sorted by language granularity. Previous approaches utilize task-specific designs tailored to address particular types of language instructions, as shown in (a) and (b). In contrast, we propose a versatile system that can interpret and execute arbitrary language instructions as shown in (c).
Figure 2. Illustration of MoE position and experts’ routing methods. SAME routing based on multimodal features from visual observations and language instructions allows the agent to dynamically adapt to environmental visual changes.
- Release SAME finetuning code.
- Release multi-task co-training data.
- Release pretrained models weights.
- Release data preparation scripts.
Note: SAME is simulator-free! You do not need to install Matterport3D simulator or Habitat simulator. The codebase works entirely with pre-computed visual features and connectivity graphs.
- Create a conda environment and install all dependencies:
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txtThat's it! No simulator installation required.
Download the required datasets and features from HuggingFace:
python download.py --dataThis script will automatically download all navigation datasets and pre-computed features from HuggingFace: ZGZzz/VersNav, including:
- 9 navigation datasets (R2R, REVERIE, RXR-EN, CVDN, SOON, OBJNAV_MP3D + augmented versions)
- Pre-computed CLIP ViT-B/16 visual features for all simulators
- Connectivity graphs for MatterSim, Habitat-MP3D, and Habitat-HM3D
The data directory should be structured as follows:
data/
├── simulator/
│ ├── connectivity/ # MatterSim connectivity graphs
│ ├── habitat_mp3d_connectivity/ # Habitat MP3D connectivity graphs
│ ├── habitat_hm3d_connectivity/ # Habitat HM3D connectivity graphs
│ ├── mp3d_scanvp_candidates.json
│ ├── habitat_mp3d_scanvp_candidates.json
│ ├── habitat_hm3d_scanvp_candidates.json
│ ├── mp3d_connectivity_graphs.json
│ ├── habitat_mp3d_connectivity_graphs.json
│ └── habitat_hm3d_connectivity_graphs.json
├── features/
│ └── img_features/
│ ├── clip_vit-b16_mp3d_hm3d_gibson.hdf5 # CLIP features for MatterSim & HM3D
│ └── MP3D_habitat_clip_b16.lmdb # CLIP features for MP3D Habitat
├── R2R/
│ ├── R2R_train_mergesim_enc.json
│ ├── R2R_val_train_seen_enc.json
│ ├── R2R_val_seen_enc.json
│ ├── R2R_val_unseen_enc.json
│ ├── R2R_test_enc.json
│ ├── R2R_prevalent_aug_train_enc.json # PREVALENT augmented data
│ └── R2R_scalevln_aug_train_enc.json # ScaleVLN augmented data
├── REVERIE/
│ ├── BBoxes.json
│ ├── REVERIE_train_enc.json
│ ├── REVERIE_val_train_seen_enc.json
│ ├── REVERIE_val_seen_enc.json
│ ├── REVERIE_val_unseen_enc.json
│ ├── REVERIE_test_enc.json
│ └── REVERIE_scalevln_aug_train_enc.jsonl # ScaleVLN augmented data
├── RXR-EN/
│ ├── RXR-EN_train_enc.json
│ ├── RXR-EN_val_seen_enc.json
│ └── RXR-EN_val_unseen_enc.json
├── CVDN/
│ ├── train.json
│ ├── val_seen.json
│ ├── val_unseen.json
│ └── test_cleaned.json
├── SOON/
│ ├── train_enc_pseudo_obj_ade30k_label.jsonl
│ ├── val_unseen_instrs_enc_pseudo_obj_ade30k_label.jsonl
│ ├── val_unseen_house_enc_pseudo_obj_ade30k_label.jsonl
│ └── test_v2_enc.jsonl
└── MP3D/
├── habitatweb/ # Habitat-web human demonstrations for ObjectNav
│ ├── train/
│ └── val_train_seen/
└── v1/
└── val/
Download the ScaleVLN pretrained models from HuggingFace:
# Download all pretrained models
python download.py --pretrain
# Or download specific model
python download.py --pretrain --model attnq # MoE at Attention Query
python download.py --pretrain --model attnkv # MoE at Attention Key-Value
python download.py --pretrain --model ffn # MoE at Feed-Forward NetworkThis will download pretrained checkpoints from HuggingFace: ZGZzz/SAME to data/pretrain/:
data/pretrain/
├── Attnq_pretrained_ckpt.pt # Pretrained model with MoE at Attn_q
├── Attnkv_pretrained_ckpt.pt # Pretrained model with MoE at Attn_kv (optional)
└── FFN_pretrained_ckpt.pt # Pretrained model with MoE at FFN (optional)
If you want to use our trained model checkpoints for evaluation:
python download.py --checkpointsThis will download trained model checkpoints from HuggingFace: ZGZzz/SAME to data/ckpts/.
To download all data and models at once:
python download.py --data --pretrain --checkpointsSAME is completely simulator-free! The codebase works entirely with:
- Pre-computed CLIP ViT-B/16 visual features
- Pre-built connectivity graphs
- No need to install or run Matterport3D or Habitat simulators
SAME supports 9 different navigation datasets simultaneously:
Low-Level Language-Guided Navigation:
- R2R (Room-to-Room): Fine-grained instruction following
- R2R-PREVALENT: Augmented R2R with environment dropout
- R2R-ScaleVLN: Augmented R2R with HM3D scenes
- REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments
- REVERIE-ScaleVLN: Augmented REVERIE with HM3D scenes
- RXR-EN: Multilingual Room-to-Room Navigation (English)
High-Level Category-Specific Search:
- CVDN: Cooperative Vision-and-Dialog Navigation
- SOON: Semantic Object-Oriented Navigation
- ObjectNav-MP3D: Object Navigation in Matterport3D
Configure dataset sampling ratios in the config file:
task:
source: ['R2R_SCALEVLN', 'R2R_PREVALENT', 'R2R', 'REVERIE_SCALEVLN',
'REVERIE', 'RXR-EN', 'CVDN', 'SOON', 'OBJNAV_MP3D']
ratio: [20, 1, 1, 10, 1, 1, 1, 1, 2] # Sampling ratiosSAME supports features from multiple simulators/renderers:
-
MatterSim: Original Matterport3D panoramic renderer
- Used for: R2R, REVERIE, RXR-EN, CVDN, SOON
-
Habitat-MP3D: Habitat simulator with MP3D scenes
- Used for: ObjectNav-MP3D, alternative R2R training
-
Habitat-HM3D: Habitat simulator with HM3D scenes
- Used for: ScaleVLN augmented datasets
Configure simulation environments per dataset:
task:
train_simulation_env:
"R2R": ["mattersim", "mp3d_habitat"] # Can use multiple renderers
"R2R_SCALEVLN": "hm3d_habitat"
"OBJNAV_MP3D": "mp3d_habitat"
eval_simulation_env:
"R2R": "mattersim"
"OBJNAV_MP3D": "mp3d_habitat"SAME introduces task-based MoE routing that adapts to different navigation tasks:
MoE Position Options:
Attn_q: MoE on attention query projectionAttn_kv: MoE on attention key-value projectionsFFN: MoE on feed-forward network
Routing Feature Options:
cls: Text [CLS] token embeddingmean: Mean-pooled text embeddingsmulti: Fused multimodal (text + visual) embeddings ⭐ Best performancetask_id: Task embeddingstask_id_cls: Task embedding + text [CLS]task_id_multi: Task embedding + multimodal features
Configuration example:
model:
use_moe_layer: true
moe_type: "Task" # Task-based or Sparse
moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN
task_routing_feature: "multi" # Routing based on multimodal features
num_experts: 8
num_experts_per_tok: 2 # Top-2 expert selection
router_aux_loss_coef: 0.8SAME uses OmegaConf for hierarchical configuration management.
configs/default.yaml: Base configuration with all default settingsconfigs/main_multi_q.yaml: Main experiment config (overrides defaults)- Command-line
--options: Runtime overrides (highest priority)
experiment:
id: "experiment_name" # Experiment identifier
output_dir: "output" # Output directory
data_dir: "../data" # Data root directory
seed: 42 # Random seed
resume_file: null # Checkpoint to resume from
test: false # Test mode (no training)
eval_first: true # Evaluate before trainingmodel:
num_l_layers: 9 # Language encoder layers
num_pano_layers: 2 # Panorama encoder layers
num_x_layers: 4 # Cross-attention layers
graph_sprels: true # Use spatial relations
pretrained_ckpt: "../data/pretrain/Attnkv_pretrained_ckpt.pt"
# MoE Settings
use_moe_layer: true
moe_position: "Attn_q" # or "Attn_kv" or "FFN"
task_routing_feature: "multi"
num_experts: 8
num_experts_per_tok: 2training:
iters: 500000 # Total training iterations
num_iters_per_epoch: 5000 # Iterations per epoch
batch_size: 16 # Training batch size
val_batch_size: 32 # Validation batch size
learning_rate: 0.00001 # Learning rate
feedback: "sample" # teacher, sample, or argmax
train_alg: "dagger" # imitation or dagger
workers: 4 # DataLoader workerstask:
source: ['R2R_SCALEVLN', 'R2R_PREVALENT', 'R2R', 'REVERIE_SCALEVLN',
'REVERIE', 'RXR-EN', 'CVDN', 'SOON', 'OBJNAV_MP3D']
ratio: [10, 1, 1, 1, 1, 1, 1, 1, 2] # Dataset sampling ratios
# Specify simulator for each dataset
train_simulation_env:
"R2R": ["mattersim", "mp3d_habitat"] # Multiple simulators!
"R2R_SCALEVLN": "hm3d_habitat"
"REVERIE": "mattersim"
"OBJNAV_MP3D": "mp3d_habitat"Override config values via command line:
cd src
python run.py --config_dir configs/main_multi_q.yaml \
--options training.batch_size=32 \
model.num_experts=16 \
experiment.seed=123Train with the main multi-task configuration:
cd src
python run.py --config_dir configs/main_multi_q.yamlThis will:
- Load pretrained checkpoint from
data/pretrain/Attnq_pretrained_ckpt.pt - Train on all 9 datasets with configured sampling ratios
- Evaluate on validation sets before training (
eval_first: true) - Save checkpoints to
output/TaskMoE-multi-q/ckpts/
cd src
torchrun \
--nproc_per_node=4 \
--master_port=29500 \
run.py --config_dir configs/main_multi_q.yamlCustomize hyperparameters via command line:
cd src
python run.py --config_dir configs/main_multi_q.yaml \
--options training.batch_size=32 \
training.learning_rate=0.00005 \
experiment.seed=42Train with MoE at different positions:
# MoE at Attention Key-Value
python run.py --config_dir configs/main_multi_kv.yaml
# MoE at Feed-Forward Network
python run.py --config_dir configs/main_multi_FFN.yamlEvaluate a trained model on validation/test splits:
cd src
python run.py --config_dir configs/test.yaml \
--options experiment.resume_file=/path/to/checkpoint.ptOr create a test config file:
# configs/test.yaml
experiment:
id: "test"
test: true
resume_file: "output/TaskMoE-multi-q/ckpts/epoch_xx.pt"
training:
val_batch_size: 32
workers: 4
model:
moe_position: "Attn_q"
pretrained_ckpt: "../data/pretrain/Attnq_pretrained_ckpt.pt"
task_routing_feature: "multi"Then run:
cd src
python run.py --config_dir configs/test.yamlSAME evaluates on multiple metrics:
- SR (Success Rate): Percentage of successful navigations
- SPL (Success weighted by Path Length): Efficiency metric
- nDTW (normalized Dynamic Time Warping): Path similarity to ground truth
- NE (Navigation Error): Distance to goal at end
- OSR (Oracle Success Rate): Success rate with oracle stop
Results are saved in the output directory and logged to console.
We extend our gratitude to MatterPort 3D for their valuable contributions to the open-source platform and community.
We also acknowledge the significant benefits of using DUET, ScaleVLN and NaviLLM in this work. Our thanks go out to the creators of these outstanding projects.
If you find this work helpful, please consider citing:
@article{zhou2024same,
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
journal={arXiv preprint arXiv:2412.05552},
year={2024},
}
