| CARVIEW |
Approach
Steer3D adapts ControlNet to 3D generation, and thus injects text steerability to pretrained image-to-3D models. As shown below, given an image (e.g. of a crab), existing image-to-3D models can generate a 3D crab that looks like the image. Steer3D allows the user to edit the 3D crab with language, such as "replacing its legs with sleek robotic limbs colored silver". The new crab aligns with the editing text, and keeps consistent with the original crab. Steer3D trains on 100k-scale synthetic data generated by our automated data engine. Our data engine combines existing image-to-3D models and vision language models to provide editing pairs that are diverse, consistent, and correct. Both our scalable data engine approach and our data-efficient architecture design help yield a strong editing model.
Architecture and Recipe
To facilitate data-efficient training, we design a ControlNet-based architecture to leverage the shape and geometry prior of pretrained image-to-3D models. The architecture is shown below. We design a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO) to avoid the trivial local minumum of "no edit". More details can be found in the paper!
Data Engine
We build a data engine to generate synthetic data with a two-stage filter to provide diverse, consistent, and correct editing pairs as our training data. Check out the paper for our scaling analysis that backs up the importance of this scalable data strategy!
BibTeX
@misc{ma2025feedforward3deditingtextsteerable,
title={Feedforward 3D Editing via Text-Steerable Image-to-3D},
author={Ziqi Ma and Hongqiao Chen and Yisong Yue and Georgia Gkioxari},
year={2025},
eprint={2512.13678},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.13678},
}