| CARVIEW |
M3
3D-Spatial MultiModel Memory
ICLR 2025
Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng,
Jianglong Ye, Sifei Liu, Xiaolong Wang
UC San Diego, NVIDIA
M3 is a framework for rendering static 3D scenes with RGB and foundation model embeddings, enabling rich spatial and semantic understanding.
Precise Feature Reconstruction: M3 uses Gaussian Memory Attention to reconstruct spatial memory directly in the foundation model's embedding space, avoiding distillation and preserving the source model's embedding space.
Efficient Feature Representation: M3 reduces embedding dimensions from 64 to 16–32 per Gaussian primitive, achieving equal or better performance with 50% fewer dimensions for improved efficiency.
The best part? Fewer parameters, original embedding space!
Technical Summary Video
Interactive Demo
We propose a new visualization tool that support streaming of 3D scene reconstruction for RGB, and Foundation Model embeddings with GPU as backend.
Real Robot Deployment
We deploy M3 on tabletop manipulation tasks and show demo videos, noted that M3 is currently only used for localization and mapping.
Memory to Rendering
In the visualization below, we have shown the raw feature manifold (blue points) and the memory extracted by M3 (red points), with the proposed M3 method, we apply gaussian memory attention from principle scene component to the rendered high resolution feature map (third row).