| CARVIEW |
Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
We are excited to present Concerto 🎶, a superior model for 3D representations, simulating human concept learning process for spatial cognition and combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding.
Video
We present PCA visualizations of Concerto's inference on point cloud and video data, comparing the raw input (left) to the processed result (right). By employing joint 2D-3D self-supervised learning, Concerto effectively unlocks the potential of large-scale unlabeled point cloud datasets. With current feed-forward reconstruction methods compacting videos with spatial prior knowledge, Concerto exhibits strong performance on video-lifted point cloud, paving the way to lifted spatial intelligence. With oceans of unlabeled video data online, we can obtain oceans of opportunities with Concerto.
Gallery
We show interactive point cloud visualizations for both point cloud and video data. The raw RGB colorful point cloud is on the left. The PCA processed representation visualization is on the right.
(Mouse wheel to zoom in/out, drag to rotate, ctrl + drag to pan)
For more examples, use our inference github code to generate your own visualizations! Check it out
Abstract
Beyond Single Modality
- Start from human cognition learning:
Our inspiration toward this target is rooted in how humans learn abstract concepts: through multisensory synergy. Consider the example of an apple (as illustrated in right)—our understanding of it is formed through repeatedly seeing, touching, and tasting apples, allowing us to internalize its geometry, texture, and semantic meaning in a unified, predictive way (right top). Yet once such a representation is formed, it can be evoked from just a single modality: seeing an image of an apple can vividly recall its weight and texture (right bottom). This ability to retrieve rich, structured knowledge from partial sensory input underscores the importance of learning modality-agnostic representations that are both unified and predictive.
- Towards a Superior Representation by Joint Multi-Modal Learning:
Inspired by this principle, we believe it is similar to leverage the synergy of self-supervised learning on 2D images and 3D point clouds. We begin with a pilot experiment: fusing self-supervised features from image model DINOv2 and point cloud model Sonata to benchmark the 2D, 3D, and fused representations via linear probing on ScanNet (detailed implementations can be found in our paper). Notably, this naive combination outperforms both individual modalities, suggesting the presence of complementary information and hinting at a richer representational space if the synergy that emerges are fully captured when modalities are learned together.
Pipeline of Concerto
Concerto simulates human multisensory synergy by coupling
(a) intra-modal self-distillation on 3D point clouds to progressively refine its internal spatial representations, and
(b) cross-modal joint embedding prediction that aligns point features with corresponding image patch features using camera parameters.
Applications
Language Probing: We demonstrate Concerto's ability to formulate concepts similar to human language, paving the way for future exploration of alignment with text-based semantic spaces. With linear probing, we translate Concerto's representations to language space. Below is the visualizations of object locating by inputting specific words to Concerto.
More Details?
Read the Paper
Citation
@inproceedings{zhang2025concerto,
title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
author={Zhang, Yujia and Wu, Xiaoyang and Lao, Yixing and Wang, Chengyao and Tian, Zhuotao and Wang, Naiyan and Zhao, Hengshuang},
booktitle={NeurIPS},
year={2025}
}