We present TRAvatar, a novel framework to capture and reconstruct high-fidelity volumetric avatars. Trained efficiently end-to-end on multi-view image sequences under varying illuminations, our virtual avatars can be relighted and animated in real-time of high fidelity.
Abstract
In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting.
The pipeline of our framework. TRAvatar is a relightable volumetric avatar representation learned from multiview image sequences, including dynamic expressions and varying illuminations. For each frame, a motion encoder forecasts the disentangled global rigid transformation and expression code. With the given expression code, lighting condition, and view direction, a series of decoders subsequently predict the base mesh and the volumetric primitives mounted on it. Notably, a physically-inspired appearance decoder detailed in is proposed to facilitate network training. Ultimately, the final avatar representation is computed and then rendered, adaptable to any viewpoint and any lighting condition.
Video
Acknowledgments
We would like to thank Qianfang Zou and Xuesong Niu for being our capture subjects. We would also like to acknowledge the contributions of our colleagues Guoxin Zhang, Liqian Ma, Yanpei Cao, and Xiubao Jiang, who played a part in the development of the capturing apparatus.
BibTeX
@inproceedings{yang2023towards,
title={Towards Practical Capture of High-Fidelity Relightable Avatars},
author={Yang, Haotian and Zheng, Mingwu and Feng, Wanquan and Huang, Haibin and Lai, Yu-Kun and Wan, Pengfei and Wang, Zhongyuan and Ma, Chongyang},
booktitle={SIGGRAPH Asia 2023 Conference Proceedings},
year={2023}
}