| CARVIEW |
Select Language
Results
Comparison against Deterministic Methods
(Hover to zoom)
This figure compares DiffHuman against PHORHUM and S3F.
PHORHUM outputs excellent front predictions, but exhibits over-smooth, flat geometry and blurry colours on the back.
S3F yields more detailed geometry, but colours are still often blurry.
Moreover, both these methods occasionally paste the front colour predictions onto the back incorrectly (see row 3).
Samples from DiffHuman achieve a greater level of geometric detail and colour sharpness in uncertain regions.
(Hover to zoom)
Here we compare DiffHuman against PIFuHD,
ICON and ECON.
These deterministic methods often fall back towards the mean of the training data distribution
when faced with ambiguous and challenging inputs; e.g. predicting trousers from the back instead of a long skirt in row 3.
This can be mitigated by predicting distributions over reconstructions instead, thus modelling the inherent ambiguity in this task.
Denoising Trajectory and Diversity Visualisation
This visualises the denoising trajectory for a single sample, showing the noisy observation set \(\boldsymbol{x}_t\) and clean prediction \(\boldsymbol{x}^{(t)}_{0_\Theta}\) at each timestep. \(\boldsymbol{x}^{(t)}_{0_\Theta}\) is initially simple (see back normals); geometric and colour-wise details develop over the reverse process.
We can show the emergence of sample diversity over the reverse process using the per-pixel variance of the observations in \(\boldsymbol{x}^{(t)}_{0_\Theta}\) at each timestep. Blue indicates a lower variance and red indicates a higher variance, computed over 10 samples. \(\boldsymbol{x}^{(t)}_{0_\Theta}\) becomes more diverse over time as the samples diverge. The back surface is, intuitively, more uncertain than the front.
Unconditional and Edge-Conditioned Samples
We can obtain unconditional samples from DiffHuman by training a model with random condition dropping. These samples are generated from random noise only, which is masked using a silhouette in the shape of the desired subject. However, faces and other details within the silhouette may be blurry, due to a lack of conditioning signal.
(Hover to zoom)
Conditioning on edge-maps enables finer control than masked random noise, e.g. over in-silhouette details such as facial features and clothing boundaries.
These samples are generated using a DiffHuman model that was pre-trained with conditioning RGB images,
and then fine-tuned using conditioning edge maps.
This demonstrates that samples from DiffHuman can be controlled via simpler conditioning inputs than full RGB images,
which opens the possibility for generative applications beyond reconstruction.
BibTeX
@inproceedings{
sengupta2024diffhuman,
author = {Sengupta, Akash and Alldieck, Thiemo and Kolotouros, Nikos and Corona, Enric and Zanfir, Andrei and Sminchisescu, Cristian},
title = {{DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans}},
booktitle = {CVPR},
month = {June},
year = {2024}
}