CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Wed, 26 Feb 2025 16:12:09 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67bf3d59-530a" expires: Tue, 30 Dec 2025 10:49:24 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 1345:272D88:9FD9CC:B37163:6953ABDB accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 10:39:24 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210057-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767091164.945266,VS0,VE216 vary: Accept-Encoding x-fastly-request-id: 78155f1b754970c695cc68c77412295a09df226a content-length: 5450 SlotSSMs: Slot State Space Models

SlotSSMs: Slot State Space Models

NeurIPS 2024

Jindong Jiang¹, Fei Deng¹, Gautam Singh¹, Minseung Lee²,

Sungjin Ahn²

¹Rutgers University, ²KAIST

arXiv GitHub NeurIPS Poster

TikTok Dataset

Waymo Dataset

UT Egocentric Dataset

Emergent Scene Decomposition from Depth Estimation Tasks. Colors represent the ID of slots used for predicting each position. SlotSSM is capable of exploiting the inherent modular structure of real-world videos for efficient inference, without explicit segmentation supervision.

Abstract

Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric learning, 3D visual reasoning, and long-context video understanding tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.

Method

SlotSSMs vs existing models. (a) SlotSSMs incorporate modularity through independent state transitions and sparse interactions via self-attention. (b) Traditional SSMs utilize a monolithic state vector for all past information. (c) Multi-slot Transformer-based models offer modularity but with high computational complexity. (d) Multi-slot RNN-based models have modular states but can't parallelize training (red lock). SlotSSMs combine parallelizable training, memory efficiency, and modularity for efficient temporal modeling.

Architecture

SlotSSMs are fully parallelized sequential process models that combine SSMs and Transformers. Each layer comprises:

Slot Encoder: Utilizes a Transformer to extract compact slot representations from inputs of any size.
SlotSSM: Independently updates these slots over time using separate state transitions.
Slot Mixer: Introduces inter-slot interactions through self-attention mechanisms.

Multi-Object Video Prediction

Performance of SlotSSMs on multi-object video prediction task, demonstrating significant gains over single-state baselines and comparable performance to multi-slot transformer models. This highlights the necessity of modular state representation in video modeling.

Long-Context Reasoning

We introduce the Blinking Color Balls Benchmark, specifically designed to assess the capability to model multi-object interactions and long-range dependencies with sequence lengths extending up to 2560.

SlotSSMs demonstrate their efficiency in long-context reasoning, outperforming existing models in terms of prediction accuracy and computational efficiency.

Unsupervised Object-Centric Learning

We propose the OC-SlotSSMs variant for unsupervised object-centric representation learning. OC-SlotSSMs outperform existing methods in both unsupervised object segmentation and downstream property prediction.

3D Visual Reasoning

We evaluate SlotSSMs on the CATER dataset, a challenging 3D visual reasoning benchmark. OC-SlotSSMs achieve superior performance on both direct training and pre-training + fine-tuning settings.

Emergent Modularity in Real-World Videos

TikTok Dataset Results

SlotSSM (Ours) SAVi++

Waymo Dataset Results

SlotSSM (Ours) SAVi++

UT Egocentric Dataset Results

SlotSSM (Ours) SAVi++

We applied OC-SlotSSMs to a depth estimation task on real-world datasets. SlotSSMs is capable of exploiting modular representations to understand scene structures in real-world videos, without explicit segmentation supervision. We show slot decomposition and depth estimation in the videos.

Note: In this task, our goal is not to surpass existing depth estimation models but to use this task, manageable with our lab resources, to showcase the emerging modularity in SlotSSMs for real-world video processing. For the TikTok dataset, we manually changed the colors for two background slots to grey for a more aesthetically pleasing visualization.

BibTeX

@inproceedings{jiang2024slot,
    title = {Slot State Space Models},
    author = {Jiang, Jindong and Deng, Fei and Singh, Gautam and Lee, Minseung and Ahn, Sungjin},
    booktitle = {Advances in Neural Information Processing Systems},
    pages = {11602--11633},
    url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/158ac5698e36a01ee5ca9e6732685b34-Paper-Conference.pdf},
    volume = {37},
    year = {2024}
  }

Original Source | Taken Source