| CARVIEW |
E3D-Bench: A Benchmark for
End-to-End 3D Geometric Foundation Models
Yan Wang4, Boris Ivanovic4, Marco Pavone4,5, Chen Chen3, Zhangyang Wang1 Zhiwen Fan1
3University of Central Florida 4NVIDIA Research 5Stanford University
Abstract
Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants. With the rapid proliferation of 3D GFMs, we ask:
Effectiveness
| Method | DTU | 7-Scenes | NRGBD | ScanNet | TUM-RGBD | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | ACC ↓ | Comp ↓ | NC ↑ | |
| DUS3R/LSM | 1.731 | 1.936 | 0.786 | 0.146 | 0.181 | 0.744 | 0.144 | 0.154 | 0.867 | 0.474 | 0.420 | 0.714 | 1.108 | 0.746 | 0.724 |
| MASt3R | 1.895 | 2.003 | 0.788 | 0.262 | 0.254 | 0.732 | 0.113 | 0.102 | 0.810 | 0.467 | 0.389 | 0.701 | 0.738 | 0.747 | 0.739 |
| Spann3R | 6.275 | 5.460 | 0.705 | 0.255 | 0.188 | 0.653 | 0.262 | 0.262 | 0.628 | 0.487 | 0.408 | 0.617 | 1.561 | 1.002 | 0.621 |
| FLARE | 3.406 | 3.950 | 0.491 | 0.152 | 0.154 | 0.704 | 0.060 | 0.056 | 0.839 | 0.357 | 0.302 | 0.561 | 0.515 | 0.486 | 0.677 |
| CUT3R | 6.885 | 5.022 | 0.727 | 0.118 | 0.142 | 0.717 | 0.104 | 0.078 | 0.828 | 0.260 | 0.238 | 0.692 | 0.587 | 0.553 | 0.683 |
| VGGT | 2.716 | 2.301 | 0.765 | 0.077 | 0.080 | 0.762 | 0.069 | 0.071 | 0.903 | 0.063 | 0.079 | 0.798 | 0.385 | 0.331 | 0.747 |
| Fast3R | 4.493 | 3.681 | 0.735 | 0.149 | 0.116 | 0.692 | 0.361 | 0.201 | 0.782 | 0.546 | 0.306 | 0.621 | 0.955 | 0.630 | 0.627 |
| MonST3R | 20.145 | 10.322 | 0.603 | 0.276 | 0.277 | 0.677 | 0.471 | 0.458 | 0.659 | 0.623 | 0.541 | 0.594 | 1.688 | 1.031 | 0.670 |
| DUS3R/LSM | 1.284 | 1.349 | 0.720 | 0.022 | 0.029 | 0.709 | 0.035 | 0.024 | 0.838 | 0.026 | 0.022 | 0.784 | 0.620 | 0.474 | 0.718 |
| MASt3R | 1.374 | 1.409 | 0.723 | 0.025 | 0.028 | 0.697 | 0.043 | 0.042 | 0.809 | 0.035 | 0.020 | 0.757 | 0.209 | 0.211 | 0.708 |
| Spann3R | 6.505 | 3.110 | 0.668 | 0.176 | 0.087 | 0.599 | 0.343 | 0.073 | 0.661 | 0.262 | 0.118 | 0.606 | 0.635 | 0.930 | 0.662 |
| CUT3R | 4.710 | 2.413 | 0.699 | 0.025 | 0.028 | 0.665 | 0.076 | 0.029 | 0.782 | 0.042 | 0.030 | 0.693 | 0.740 | 0.595 | 0.665 |
| VGGT | 2.103 | 1.925 | 0.748 | 0.019 | 0.032 | 0.659 | 0.015 | 0.012 | 0.874 | 0.016 | 0.017 | 0.728 | 0.065 | 0.091 | 0.692 |
| Fast3R | 3.647 | 2.319 | 0.725 | 0.046 | 0.057 | 0.636 | 0.059 | 0.028 | 0.772 | 0.200 | 0.097 | 0.625 | 0.711 | 0.337 | 0.610 |
| MonST3R | 14.455 | 7.508 | 0.636 | 0.100 | 0.091 | 0.648 | 0.336 | 0.246 | 0.665 | 0.346 | 0.293 | 0.599 | 1.138 | 0.948 | 0.591 |
| Method | CO3Dv2 | ScanNet & ADT & TUM-Dyn. | KITTI Odometry | Bonn & Sintel & Rel10k | ACID & Syndrone | ULTRRA | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | RPEtrans ↓ | RPErot ↓ | |
| DUSt3R/LSM | 0.903 | 1.325 | 4.312 | 0.139 | 0.102 | 2.394 | 2.935 | 1.135 | 2.832 | 0.077 | 0.557 | 1.657 | 0.126 | 0.379 | 2.836 | 70.350 | 70.390 |
| MASt3R | 0.987 | 1.407 | 3.999 | 0.131 | 0.098 | 2.889 | 1.492 | 0.399 | 0.407 | 0.058 | 0.559 | 1.305 | 0.130 | 0.376 | 2.601 | 71.519 | 78.036 |
| Spann3R | 0.915 | 1.295 | 6.352 | 0.294 | 0.164 | 3.778 | 15.848 | 5.031 | 4.645 | 0.083 | 0.102 | 1.297 | 0.117 | 0.149 | 1.484 | 40.503 | 38.366 |
| CUT3R | 0.847 | 1.209 | 6.361 | 0.185 | 0.133 | 4.471 | 2.421 | 0.747 | 0.669 | 0.033 | 0.039 | 0.500 | 0.071 | 0.090 | 0.914 | 55.135 | 54.395 |
| VGGT | 0.478 | 0.704 | 2.264 | 0.113 | 0.086 | 1.535 | 0.955 | 0.315 | 0.335 | 0.062 | 0.111 | 0.580 | 0.280 | 0.461 | 0.802 | 63.451 | 77.281 |
| Fast3R | 0.698 | 1.035 | 4.352 | 0.499 | 0.391 | 23.739 | 22.109 | 7.573 | 7.366 | 0.111 | 0.170 | 2.017 | 0.436 | 0.518 | 1.979 | 51.149 | 54.150 |
| MonST3R | 2.456 | 3.327 | 23.458 | 0.448 | 0.286 | 12.817 | 2.426 | 0.782 | 0.949 | 0.098 | 0.152 | 0.830 | 0.335 | 0.504 | 1.514 | 70.388 | 77.325 |
| Align3R | 1.027 | 1.550 | 6.499 | 0.425 | 0.215 | 9.430 | 4.611 | 0.817 | 0.600 | 0.076 | 0.091 | 1.083 | 0.150 | 0.179 | 0.977 | 72.010 | 70.638 |
| Easi3R | 0.857 | 1.271 | 5.052 | 0.174 | 0.103 | 2.872 | 3.625 | 0.919 | 0.615 | 0.075 | 0.094 | 1.361 | 0.119 | 0.138 | 1.733 | 62.061 | 71.060 |
| Geo4D | 0.798 | 1.264 | 5.692 | 0.436 | 0.175 | 10.565 | 1.662 | 0.497 | 0.696 | 0.573 | 0.472 | 3.779 | 0.384 | 0.329 | 1.395 | - | - |
| Aether | 3.168 | 2.366 | 21.643 | 0.644 | 0.273 | 14.804 | 1.553 | 0.744 | 0.744 | 0.195 | 0.122 | 1.610 | 0.152 | 0.097 | 0.796 | - | - |
| Method | DTU | ScanNet | KITTI | ETH3D | T&T | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | AbsRel ↓ | δ<1.03 ↑ | |
| Robust MVD | 2.490 | 80.056 | 7.468 | 35.651 | 9.419 | 30.505 | 9.302 | 42.909 | 6.379 | 58.409 |
| DUSt3R/LSM | 2.741 | 75.685 | 4.732 | 61.337 | 9.113 | 39.495 | 3.132 | 74.851 | 3.106 | 77.033 |
| MASt3R | 3.343 | 68.301 | 5.949 | 54.516 | 9.542 | 46.805 | 2.471 | 81.291 | 2.381 | 82.262 |
| Spann3R | 6.431 | 38.339 | 7.779 | 33.713 | 10.195 | 30.858 | 5.121 | 54.708 | 5.580 | 52.812 |
| CUT3R | 6.200 | 47.421 | 8.231 | 39.464 | 23.849 | 12.087 | 5.224 | 59.864 | 4.594 | 56.773 |
| VGGT | 1.085 | 94.305 | 4.386 | 64.968 | 9.436 | 41.309 | 1.782 | 86.337 | 2.075 | 85.174 |
| Fast3R | 3.940 | 62.120 | 6.271 | 50.283 | 13.390 | 26.734 | 4.692 | 62.663 | 4.423 | 64.873 |
| MonST3R | 5.346 | 67.977 | 5.557 | 53.309 | 10.191 | 40.274 | 3.368 | 72.624 | 3.289 | 72.491 |
| Robust MVD | 2.242 | 84.574 | 8.016 | 35.924 | 10.846 | 25.534 | 10.944 | 35.526 | 6.982 | 60.643 |
| MASt3R | 84.904 | 0.000 | 93.584 | 0.000 | 99.069 | 0.000 | 97.021 | 0.000 | 98.234 | 0.000 |
| CUT3R | 84.904 | 0.000 | 93.584 | 0.000 | 99.069 | 0.000 | 97.022 | 0.000 | 98.234 | 0.000 |
| Method | Bonn | TUM Dyn | KITTI | PointOdyssey | Syndrone | Sintel | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | AbsRel ↓ | δ<1.25 ↑ | |
| DepthAnyVideo | 0.515 | 25.3 | 0.184 | 84.6 | 0.074 | 95.3 | 0.417 | 61.7 | 0.299 | 83.1 | 0.455 | 47.9 |
| VideoDepthAnything | 0.268 | 48.3 | 1.101 | 89.0 | 0.060 | 98.2 | 0.283 | 70.3 | 0.138 | 92.5 | 1.691 | 45.4 |
| DepthCrafter | 0.107 | 88.3 | 0.159 | 79.5 | 0.120 | 86.2 | 0.144 | 81.3 | 0.380 | 87.5 | 0.354 | 58.2 |
| Marigold | 0.329 | 52.2 | 0.600 | 32.8 | 0.332 | 43.3 | 0.346 | 47.5 | 1.331 | 16.8 | 0.417 | 45.4 |
| DUSt3R/LSM | 0.174 | 83.5 | 0.187 | 79.2 | 0.124 | 84.9 | 0.168 | 77.8 | 0.063 | 96.9 | 0.475 | 59.1 |
| MASt3R | 0.160 | 81.5 | 0.162 | 83.1 | 0.082 | 93.2 | 0.150 | 79.3 | 0.046 | 97.5 | 0.374 | 63.9 |
| Spann3R | 0.205 | 77.4 | 0.204 | 70.6 | 0.449 | 49.1 | 0.303 | 58.4 | 0.241 | 74.5 | 0.587 | 43.3 |
| CUT3R | 0.068 | 95.0 | 0.108 | 84.7 | 0.104 | 89.9 | 0.095 | 88.4 | 0.111 | 89.5 | 0.466 | 56.0 |
| VGGT | 0.056 | 96.3 | 0.068 | 93.9 | 0.051 | 96.6 | 0.026 | 99.0 | 0.075 | 95.9 | 0.242 | 65.9 |
| Fast3R | 0.232 | 69.4 | 0.221 | 71.1 | 0.308 | 46.8 | 0.271 | 66.2 | 0.368 | 44.8 | 0.565 | 48.7 |
| MonST3R | 0.061 | 95.4 | 0.197 | 72.6 | 0.083 | 93.4 | 0.066 | 92.3 | 0.110 | 89.7 | 0.343 | 59.4 |
| Align3R | 0.062 | 96.8 | 0.107 | 90.1 | 0.105 | 89.2 | 0.077 | 93.3 | 0.097 | 92.9 | 0.237 | 69.0 |
| Easi3R | 0.061 | 95.8 | 0.192 | 76.9 | 0.150 | 76.2 | 0.143 | 82.1 | 0.095 | 94.0 | 0.323 | 53.9 |
| Geo4D | 0.060 | 97.8 | 0.096 | 93.2 | 0.086 | 93.8 | 0.082 | 93.0 | 0.105 | 93.1 | 0.205 | 73.2 |
| Aether | 0.582 | 61.2 | 0.192 | 80.6 | 0.065 | 96.2 | 0.123 | 87.9 | 0.145 | 91.1 | 0.343 | 69.4 |
| GeometryCrafter | 0.061 | 96.8 | 0.115 | 87.7 | 0.410 | 53.8 | 0.124 | 83.6 | 0.123 | 90.8 | 0.280 | 72.4 |
| MASt3R | 0.549 | 4.6 | 0.633 | 0.9 | 0.754 | 6.4 | 0.749 | 0.2 | 0.967 | 0 | 0.701 | 2.3 |
| CUT3R | 0.097 | 90.3 | 0.135 | 80.6 | 0.118 | 87.4 | 0.127 | 88.1 | 0.824 | 0 | 1.020 | 23.6 |
| Method | DTU | RealEstate10k | ScanNet++ | ACID | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | |
| LSM | 17.38 | 0.6274 | 0.3198 | 18.92 | 0.6677 | 0.3643 | 17.12 | 0.6860 | 0.3887 | 20.46 | 0.6160 | 0.3822 |
| NoPoSplat | 17.91 | 0.6306 | 0.2810 | 24.53 | 0.8450 | 0.1634 | 22.15 | 0.7988 | 0.2359 | 25.35 | 0.7774 | 0.1875 |
| FLARE | 17.01 | 0.5672 | 0.2901 | 22.15 | 0.7126 | 0.2363 | 23.19 | 0.8117 | 0.2201 | 22.44 | 0.6229 | 0.2818 |
Inference Efficiency
| Method | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | Time ↓ | GPU ↓ | |
| DUST3R | 0.35 ± 0.19 | 2.49 | 6.00 ± 0.30 | 2.6 | 13.96 ± 0.86 | 3.65 | 50.37 ± 2.28 | 8.38 | 196.81 ± 6.38 | 27.52 | OOM | OOM | OOM | OOM | OOM | OOM |
| MASt3R | 9.43 ± 0.28 | 2.61 | 14.63 ± 0.52 | 2.68 | 21.38 ± 2.26 | 2.78 | 42.28 ± 9.06 | 3.35 | 117.77 ± 40.83 | 6.87 | 392.23 ± 184.36 | 28.78 | OOM | OOM | OOM | OOM |
| Spann3R | 0.16 ± 0.12 | 2.79 | 0.28 ± 0.01 | 2.8 | 0.65 ± 0.00 | 2.81 | 1.38 ± 0.01 | 2.84 | 2.81 ± 0.07 | 2.89 | 5.51 ± 0.03 | 2.99 | 11.25 ± 0.16 | 3.19 | 23.64 ± 0.70 | 3.55 |
| CUT3R | 0.19 ± 0.07 | 3.33 | 0.26 ± 0.04 | 3.38 | 0.42 ± 0.03 | 3.48 | 0.78 ± 0.03 | 3.65 | 1.50 ± 0.03 | 4.28 | 3.12 ± 0.31 | 5.54 | 5.76 ± 0.12 | 11.68 | 11.65 ± 0.16 | 17.36 |
| VGGT | 0.32 ± 0.41 | 7.11 | 0.29 ± 0.40 | 7.72 | 0.24 ± 0.01 | 9.06 | 0.72 ± 0.49 | 10.29 | 2.35 ± 0.04 | 12.75 | 4.23 ± 0.07 | 17.66 | 11.76 ± 0.41 | 28.65 | 34.21 ± 2.51 | 50.92 |
| Fast3R | 0.13 ± 0.14 | 4.05 | 0.11 ± 0.03 | 4.26 | 0.15 ± 0.02 | 4.75 | 0.30 ± 0.01 | 5.8 | 0.69 ± 0.02 | 7.25 | 1.78 ± 0.03 | 8.43 | 5.13 ± 0.06 | 10.91 | 16.55 ± 0.12 | 15.75 |
| MonST3R | 0.32 ± 0.25 | 2.79 | 14.78 ± 0.52 | 4.8 | 18.77 ± 0.20 | 7.84 | 35.76 ± 0.35 | 8.9 | 73.19 ± 0.37 | 16.15 | 148.17 ± 0.99 | 32.99 | 605.83 ± 25.24 | 66.66 | OOM | OOM |
| Easi3R | 0.35 ± 0.19 | 2.49 | 17.35 ± 1.10 | 3.41 | 24.18 ± 0.76 | 4.15 | 60.12 ± 2.67 | 7.69 | 137.16 ± 10.86 | 15.96 | 273.78 ± 2.08 | 32.53 | 901.05 ± 5.29 | 65.68 | OOM | OOM |
Findings and Takeaways
What Is the Impact of Tasks with Different Difficulties?
- Multi-view geometry inference is inherently harder than pair-view inference.
- Directly predicting dense 3D scene representations is much more challenging than estimating individual 3D attributes like depth and camera poses.
- Metric-scale depth estimation remains a key challenge for GFMs.
- Joint prediction of multiple geometric attributes (e.g., pose, depth, matching) may underlie recent performance gains.
Takeaway 1: Current GFMs are promising but face significant challenges when learning from overly complex tasks. Recommendation: Carefully decomposing difficult tasks (e.g., jointly predicting geometry, pose, depth, and tracking) into simpler sub-problems can facilitate more effective learning, especially under limited 3D data.
Do GFMs Generalize Well on Different Data Domains?
- GFMs struggle to generalize in domains with extreme data scarcity.
Takeaway 2: Diverse, high-quality data is critical for strong generalization. To improve robustness in underrepresented domains, GFMs must be trained on data that covers broader distributions and metric-scale annotations.
Hints for Model Architecture Design, ViT or Diffusion? Strong 2D Feature Extractor?
- No single design, feed-forward ViT or diffusion, is universally superior.
- Stronger 2D foundation models can significantly enhance 3D GFMs.
Takeaway 3: No single backbone—feed -forward ViT or diffusion, dominates; architecture choice should align with task needs. Moreover, leveraging strong 2D feature extractors (e.g., DINO) substantially boosts 3D performance.
Are Current GFMs Ready for Real-Time Perception Systems?
- Despite progress, GFMs still lack the efficiency required for real-time 3D applications.
Takeaway 4: As GFMs scale to handle more views and complex tasks, efficiency becomes as critical as accuracy for enabling real-time 3D perception.
Citation
@article{cong2025e3dbench,
title={E3D-Bench: An End-to-End Benchmark for 3D Geometric Foundation Models},
author={Cong, Wenyan and Liang, Yiqing and Zhang, Yancheng and Yang, Ziyi and Wang, Yan and Ivanovic, Boris and Pavone, Marco and Chen, Chen and Wang, Zhangyang and Fan, Zhiwen},
journal={arXiv preprint arXiv:2506.01933},
year={2025}
}