| CARVIEW |
GAS: Generative Avatar Synthesis
from a Single Image
ICCV 2025
1Carnegie Mellon University 2Shanghai AI Laboratory 3Stanford University
Abstract
We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. Existing diffusion-based methods often condition on sparse human templates (e.g., depth or normal maps), which leads to multi-view and temporal inconsistencies due to the mismatch between these signals and the true appearance of the subject. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. In a first step, an initial 3D reconstructed human through a generalized NeRF provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Subsequently, the derived geometry and appearance from the generalized NeRF serve as input to a video-based diffusion model. This strategic integration is pivotal for enforcing both multi-view and temporal consistency throughout the avatar's generation. Empirical results underscore the superior generalization ability of our proposed method, demonstrating its effectiveness across diverse in-domain and out-of-domain in-the-wild datasets.
Method
Starting from a single input image, GAS uses a generalizable human NeRF to map the subject into a canonical space, then reposes and renders the 3D NeRF model to extract detailed appearance cues (i.e., NeRF renderings). These are paired with geometry cues (i.e., SMPL normal maps) and fed into a video diffusion model. A switcher module disentangles the tasks, enabling the model to generate either multi-view consistent novel views or temporally coherent pose animations.
Applications
Interactive view and pose synthesis
Leveraging the unified framework, we enable interactive synthesis of human avatars, allowing users to synthesize novel views during novel pose animation.
Synchronized Multi-view Video Generation
By alternating the sampling between view and pose synthesis, we can generate synchronized multi-view videos of human performers from only a single image.
Results
Novel view synthesis
We demonstrate the capability of our method to synthesize view-consistent avatars from a single image.
Novel pose animation
We demonstrate the capability of our method to synthesize temporal-coherent avatars with realistic deformations from a single image.
Comparison with baselines
We show the comparison with baselines on the task of novel view synthesis and novel pose animation.
BibTeX
@article{lu2025gas,
title={GAS: Generative Avatar Synthesis from a Single Image},
author={Lu, Yixing and Dong, Junting and Kwon, Youngjoong and Zhao, Qin and Dai, Bo and De la Torre, Fernando},
journal={arXiv preprint arXiv:2502.06957},
year={2025}
}