Carview!

CARVIEW

MOTORHOMES

Select Language

TL;DR: Our model can create realistic 4D avatars using any number of reference images.

Overview

Our model works in two stages: First, a morphable multi-view diffusion model (MMDM) generates a large number of images from different views and expressions from the reference images. Then, we fit a 4D avatar to the generated images and the reference images. This avatar can be controlled via 3DMM and rendered in real time.

Interactive Viewer

reference image(s)

Browser Not Supported

Your browser does not appear to support the interactive viewer. Currently, only Chrome (Desktop) Version 130+ is supported. Displaying a video instead!

Method

Overview of CAP4D. (a) The method takes as input an arbitrary number of reference images \(\mathbf{I}_\text{ref}\) that are encoded into the latent space of a variational autoencoder. An off-the-shelf face tracker (FlowFace) estimates a 3DMM (FLAME), \(\mathbf{M}_\text{ref}\), for each reference image, from which we derive conditioning signals that describe camera view direction, \(\mathbf{V}_\text{ref}\), head pose \(\mathbf{P}_\text{ref}\), and expression \(\mathbf{E}_\text{ref}\). We associate additional conditioning signals with each input noisy latent image based on the desired generated viewpoints, poses, and expressions. The MMDM generates images through a stochastic input–output conditioning procedure that randomly samples reference images and generated images during each step of the iterative image generation process. (b) The generated and reference images are used with the tracked and sampled 3DMMs to reconstruct a 4D avatar based on a deformable 3D Gaussian splatting representation (GaussianAvatars).

Morphable Multi-view Diffusion Model (MMDM)

MMDM architecture. Our model is initialized from Stable Diffusion 2.1, and we adapt the architecture for multi-view generation following CAT3D. We use a pre-trained image encoder to map the input images into the latent space, and we use the latent diffusion model to process eight images in parallel. We replace 2D attention layers after 2D residual blocks with 3D attention to share information between frames. The model is conditioned using images that provide information such as head pose (\(\mathbf{P}_\text{ref/gen}\)), expression (\(\mathbf{E}_\text{ref/gen}\)), and camera view (\(\mathbf{V}_\text{ref/gen}\)). These images are obtained from a 3DMM and concatenated to the latent images. The denoised latent image is decoded using a pre-trained decoder.

Gallery

We show various avatars generated using CAP4D in different settings: avatars from few reference images, avatars from single reference images, and more challenging settings such as avatars from images generated via text-prompt and avatars from artwork. Note that while the MMDM inherits weights from Stable Diffusion, we do not train the MMDM on non-photoreal images. Please click on the arrow buttons to the sides to view all results.

Reference images are shown in the top row, images generated by the MMDM in the middle row and the final 4D avatar in the last row.

Few images to avatar

reference images

images generated with MMDM

CAP4D avatar

Single image to avatar

reference images

images generated with MMDM

CAP4D avatar

Text to image to avatar

reference images

images generated with MMDM

CAP4D avatar

Artwork to avatar

reference images

images generated with MMDM

CAP4D avatar

Baseline Comparisons

We conduct experiments on the cross-reenactment and self-reenactment tasks. For quanitative results, we refer to our paper.

Self-reenactment results. We show more qualitative results from our self-reenactment evaluation with varying numbers of reference frames. The top row shows single-, second row few- (10) and the last row shows many-image (100) reconstructions. Our 4D avatar can leverage additional reference images to produce details that are not visible in the first reference image. Our results are significantly better compared to previous methods, especially when the view direction differs greatly from the reference image.

Cross-reenactment results. We generate an avatar based on a single image from the FFHQ dataset. The camera orbits around the head to allow a better assessment of 3D structure. Our method consistently produces 4D avatars of higher visual quality and 3D consistency even across challenging view deviations. Our avatar can also model realistic view-dependent lighting changes.

More Results

Effect of reference image quantity

CAP4D generates realistic avatars from single reference images. The model can leverage additional available reference images and recover details and geometry that are not visible in the first view. This results in an overall improved reconstruction of the reference identity. We provide a side-by-side comparison with single image, few images and many images below. The differences are subtle, however, notice freckles and birthmarks appearing with more reference images.

single reference
image 4 reference
images 64 reference
images

images generated with MMDM

CAP4D avatar

single reference
image 4 reference
images 64 reference
images

images generated with MMDM

CAP4D avatar

Editing of appearance and lighting

We can edit our avatars by applying off-the-shelf image editing models to the reference image. Here, we demonstrate appearance editing (Stable-Makeup) and relighting (IC-Light).

original edited avatars

4D animation from audio

The generated avatar is controlled via FLAME 3DMM, hence we can leverage off-the-shelf speech-driven animation models such as CodeTalker to animate it from input audio.

BibTeX


@inproceedings{taubner2025cap4d,
  author    = {Taubner, Felix and Zhang, Ruihang and Tuli, Mathieu and Lindell, David B.},
  title     = {{CAP4D}: Creating Animatable {4D} Portrait Avatars with Morphable Multi-View Diffusion Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2025},
  pages     = {5318-5330}
}

Original Source | Taken Source

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

TL;DR: Our model can create realistic 4D avatars using any number of reference images.

Overview

Interactive Viewer

Method

Morphable Multi-view Diffusion Model (MMDM)

Gallery

Few images to avatar

reference images

images generated with MMDM

CAP4D avatar

Single image to avatar

reference images

images generated with MMDM

CAP4D avatar

Text to image to avatar

reference images

images generated with MMDM

CAP4D avatar

Artwork to avatar

reference images

images generated with MMDM

CAP4D avatar

Baseline Comparisons

We conduct experiments on the cross-reenactment and self-reenactment tasks. For quanitative results, we refer to our paper.

More Results

Effect of reference image quantity

single reference image 4 reference images 64 reference images

images generated with MMDM

CAP4D avatar

single reference image 4 reference images 64 reference images

images generated with MMDM

CAP4D avatar

Editing of appearance and lighting

We can edit our avatars by applying off-the-shelf image editing models to the reference image. Here, we demonstrate appearance editing (Stable-Makeup) and relighting (IC-Light).

original edited avatars

original edited avatars

4D animation from audio

The generated avatar is controlled via FLAME 3DMM, hence we can leverage off-the-shelf speech-driven animation models such as CodeTalker to animate it from input audio.

BibTeX

single reference
image 4 reference
images 64 reference
images

single reference
image 4 reference
images 64 reference
images