| CARVIEW |
SlotSSMs: Slot State Space Models
NeurIPS 2024
Abstract
Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric learning, 3D visual reasoning, and long-context video understanding tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.
Method
SlotSSMs vs existing models. (a) SlotSSMs incorporate modularity through independent state transitions and sparse interactions via self-attention. (b) Traditional SSMs utilize a monolithic state vector for all past information. (c) Multi-slot Transformer-based models offer modularity but with high computational complexity. (d) Multi-slot RNN-based models have modular states but can't parallelize training (red lock). SlotSSMs combine parallelizable training, memory efficiency, and modularity for efficient temporal modeling.
Architecture
SlotSSMs are fully parallelized sequential process models that combine SSMs and Transformers. Each layer comprises:
- Slot Encoder: Utilizes a Transformer to extract compact slot representations from inputs of any size.
- SlotSSM: Independently updates these slots over time using separate state transitions.
- Slot Mixer: Introduces inter-slot interactions through self-attention mechanisms.
Multi-Object Video Prediction
Performance of SlotSSMs on multi-object video prediction task, demonstrating significant gains over single-state baselines and comparable performance to multi-slot transformer models. This highlights the necessity of modular state representation in video modeling.
Long-Context Reasoning
SlotSSMs demonstrate their efficiency in long-context reasoning, outperforming existing models in terms of prediction accuracy and computational efficiency.
Unsupervised Object-Centric Learning
We propose the OC-SlotSSMs variant for unsupervised object-centric representation learning. OC-SlotSSMs outperform existing methods in both unsupervised object segmentation and downstream property prediction.
3D Visual Reasoning
We evaluate SlotSSMs on the CATER dataset, a challenging 3D visual reasoning benchmark. OC-SlotSSMs achieve superior performance on both direct training and pre-training + fine-tuning settings.
Emergent Modularity in Real-World Videos
TikTok Dataset Results
Waymo Dataset Results
UT Egocentric Dataset Results
We applied OC-SlotSSMs to a depth estimation task on real-world datasets. SlotSSMs is capable of exploiting modular representations to understand scene structures in real-world videos, without explicit segmentation supervision. We show slot decomposition and depth estimation in the videos.
Note: In this task, our goal is not to surpass existing depth estimation models but to use this task, manageable with our lab resources, to showcase the emerging modularity in SlotSSMs for real-world video processing. For the TikTok dataset, we manually changed the colors for two background slots to grey for a more aesthetically pleasing visualization.
BibTeX
@inproceedings{jiang2024slot,
title = {Slot State Space Models},
author = {Jiang, Jindong and Deng, Fei and Singh, Gautam and Lee, Minseung and Ahn, Sungjin},
booktitle = {Advances in Neural Information Processing Systems},
pages = {11602--11633},
url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/158ac5698e36a01ee5ca9e6732685b34-Paper-Conference.pdf},
volume = {37},
year = {2024}
}