| CARVIEW |
Select Language
TL;DR: Our model can create realistic 4D avatars using any number of reference images.
Overview
Our model works in two stages: First, a morphable multi-view diffusion model (MMDM) generates a large number of images from different views and expressions from the reference images. Then, we fit a 4D avatar to the generated images and the reference images. This avatar can be controlled via 3DMM and rendered in real time.
Method
Overview of CAP4D. (a) The method takes as input an arbitrary number of reference images \(\mathbf{I}_\text{ref}\) that are encoded into the latent space of a variational autoencoder. An off-the-shelf face tracker (FlowFace) estimates a 3DMM (FLAME), \(\mathbf{M}_\text{ref}\), for each reference image, from which we derive conditioning signals that describe camera view direction, \(\mathbf{V}_\text{ref}\), head pose \(\mathbf{P}_\text{ref}\), and expression \(\mathbf{E}_\text{ref}\). We associate additional conditioning signals with each input noisy latent image based on the desired generated viewpoints, poses, and expressions. The MMDM generates images through a stochastic input–output conditioning procedure that randomly samples reference images and generated images during each step of the iterative image generation process. (b) The generated and reference images are used with the tracked and sampled 3DMMs to reconstruct a 4D avatar based on a deformable 3D Gaussian splatting representation (GaussianAvatars).
Morphable Multi-view Diffusion Model (MMDM)
MMDM architecture. Our model is initialized from Stable Diffusion 2.1, and we adapt the architecture for multi-view generation following CAT3D. We use a pre-trained image encoder to map the input images into the latent space, and we use the latent diffusion model to process eight images in parallel. We replace 2D attention layers after 2D residual blocks with 3D attention to share information between frames. The model is conditioned using images that provide information such as head pose (\(\mathbf{P}_\text{ref/gen}\)), expression (\(\mathbf{E}_\text{ref/gen}\)), and camera view (\(\mathbf{V}_\text{ref/gen}\)). These images are obtained from a 3DMM and concatenated to the latent images. The denoised latent image is decoded using a pre-trained decoder.
Gallery
We show various avatars generated using CAP4D in different settings: avatars from few reference images,
avatars from single reference images, and more challenging settings such as avatars from images generated
via text-prompt and avatars from artwork. Note that while the MMDM inherits weights from Stable Diffusion, we do not train the MMDM on non-photoreal images.
Please click on the arrow buttons to the sides to view all results.
Reference images are shown in the top row, images generated by the MMDM in the middle row and
the final 4D avatar in the last row.
Few images to avatar
reference images
images generated with MMDM
CAP4D avatar
Single image to avatar
reference images
images generated with MMDM
CAP4D avatar
Text to image to avatar
reference images
images generated with MMDM
CAP4D avatar
Artwork to avatar
reference images
images generated with MMDM
CAP4D avatar
Baseline Comparisons
We conduct experiments on the cross-reenactment and self-reenactment tasks. For quanitative results, we refer to our paper.
Self-reenactment results. We show more qualitative results from our self-reenactment evaluation with varying numbers of reference frames. The top row shows single-, second row few- (10) and the last row shows many-image (100) reconstructions. Our 4D avatar can leverage additional reference images to produce details that are not visible in the first reference image. Our results are significantly better compared to previous methods, especially when the view direction differs greatly from the reference image.
Cross-reenactment results. We generate an avatar based on a single image from the FFHQ dataset. The camera orbits around the head to allow a better assessment of 3D structure. Our method consistently produces 4D avatars of higher visual quality and 3D consistency even across challenging view deviations. Our avatar can also model realistic view-dependent lighting changes.
More Results
Effect of reference image quantity
CAP4D generates realistic avatars from single reference images. The model can leverage additional available reference images and recover details and geometry that are not visible in the first view. This results in an overall improved reconstruction of the reference identity. We provide a side-by-side comparison with single image, few images and many images below. The differences are subtle, however, notice freckles and birthmarks appearing with more reference images.
single reference
image
4 reference
images
64 reference
images
images generated with MMDM
CAP4D avatar
single reference
image
4 reference
images
64 reference
images
images generated with MMDM
CAP4D avatar
Editing of appearance and lighting
We can edit our avatars by applying off-the-shelf image editing models to the reference image. Here, we demonstrate appearance editing (Stable-Makeup) and relighting (IC-Light).
original edited avatars
original edited avatars
4D animation from audio
The generated avatar is controlled via FLAME 3DMM, hence we can leverage off-the-shelf speech-driven animation models such as CodeTalker to animate it from input audio.
BibTeX
@inproceedings{taubner2025cap4d,
author = {Taubner, Felix and Zhang, Ruihang and Tuli, Mathieu and Lindell, David B.},
title = {{CAP4D}: Creating Animatable {4D} Portrait Avatars with Morphable Multi-View Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {5318-5330}
}