| CARVIEW |
Depth Anything 3
Recovering the Visual Space from Any Views
Bingyi Kang*†
†Proejct Lead, *Equal Contribution
TL;DR:
Depth Anything 3 recovers the space with superior geometry and 3DGS rendering from any visual inputs.
The secret? No complex tasks! No special architecture!
just a single, plain transformer trained with a depth-ray representation.
Abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
Abilities
Video Reconstruction
DA3 recovers the visual space from any number of views, covering from single view to multiple views. This demo illustrates the ability of DA3 to recover the visual space from a difficult video.
SLAM for Large-Scale Scenes
Accurate visual geometry estimation improves SLAM performance. Quantitative results show that simply replacing VGGT in VGGT-Long with DA3 (DA3-Long) significantly reduces drift in large-scale environments, even better than COLMAP, which takes more 48 hours to complete.
Feed-Forward 3D Gaussians Estimation
By freezing the entire backbone and training a DPT head to predict 3DGS parameters, our model achieves very strong and generalizable novel view synthesis capability.
Spatial Perception from Multiple Cameras
Given several images of different viewpoints from a vehicle (even without overlap), DA3 estimates stable and fusible depth maps, enhancing autonomous vehicles' environmental understanding.
Interactive Examples
Comparison
Awesome DA3 Projects
A community-curated list of Depth Anything 3 integrations across 3D tools, creative pipelines, robotics, and web/VR viewers, including but not limited to these. You are welcome to submit your DA3-based project at our github repository via PR, and we will review and feature it if applicable.
Citation
@article{depthanything3,
title={Depth Anything 3: recovering the visual space from any views},
author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
journal={arXiv preprint arXiv:2511.10647},
year={2025}
}