| CARVIEW |
Fin3R
Fine-Tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation
NeurIPS 2025
Abstract
We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (i) the scarcity of high-fidelity depth and pose supervision and (ii) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder—the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged.
Method
Fin3R fintetunes the encoder of feed-forward reconstruction models with a custom LoRA. Purple dashed lines indicate distillation supervision on coninical view (depth or pointmap); green dashed lines denote multi-view pointmap supervision. Notice that during finetuning, the decoder is frozen and only the LoRA are updated.
Monocular Depth Estimation
Our fine-tuning method consistently improves the monocular depth estimation quality of various feed-forward 3D reconstruction models, including both two-view and multi-view, relative depth and metric depth models.
Multi-view Performance
Our fine-tuning method consistently improves the pose accuracy of various feed-forward 3D reconstruction models, even without pose supervision during training. This suggests that the decoder functions as an implicit feature matcher, leveraging the improved encoder features to enhance performance without requiring explicit pose labels.
Qualitative Comparison
Our fine-tuning method improves the fine details and robustness of baseline methods.
2D Depth Estimation Results
Select a Scene:
BibTeX
@inproceedings{ren2025fin3r,
title={Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation},
author={Ren, Weining and Wang, Hongjun and Tan, Xiao and Han, Kai},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}