| CARVIEW |
Multi-Modal Manipulation via Multi-Modal Policy Consensus
Jiayuan Mao3, Yunzhu Li2, Yilun Du4†, Katherine Driggs-Campbell1†
Retains Sparse But Important Signals
Each modality has its own expert that processes its inputs independently, preventing vision from dominating critical tactile information in contact-rich tasks
Modular Design for Incremental Learning
Train modality-specific policies independently and compose them without retraining the entire system
Robust to Corruption & Perturbations
Maintains performance under sensor corruption, occlusions, and physical perturbations during execution
Why This Approach?
Feature Concatenation (Traditional)
- Vision dominates sparse tactile signals
- Monolithic training—must retrain everything when adding sensors
- Single point of failure
Policy Consensus (Ours)
- Each expert preserves its modality's information
- Modular—compose independently trained policies
- Graceful degradation under sensor failures
What You Gain
Faster Iteration
Add new sensors without retraining from scratch—save days of compute time
Better Performance
Significantly outperforms feature concatenation baselines on multimodal manipulation tasks
Real-World Robustness
Continues working under sensor corruption and environmental perturbations
Audio Summary
Prefer to listen? Hear a summary of our paper
Is Feature Concatenation the Policy Bottleneck?
Feature concatenation baseline vs. factorized MoE fusion vs. ours
Modality Importance Analysis
Perturbation-based analysis reveals dynamic shifts between modalities across task stages
Policy Adaptiveness Under Perturbations
Our policy maintains performance under runtime perturbations, object repositioning, and sensor corruptions
Modular Policy Composition
Independently trained policies can be composed without retraining, enabling incremental integration
Limitations and Failure Cases
Occasional failures under extreme sensor corruptions