| CARVIEW |
M&M VTO: Multi-Garment Virtual Try-On and Editing
Luyang Zhu1,2, Yingwei Li1, Nan Liu1, Hao Peng1,
Dawei Yang1, Ira Kemelmacher-Shlizerman1,2
1Google Research 2University of Washington
CVPR 2024 (Highlight)
We present M&M VTO–a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.
Approach
Overview of M&M VTO. Left: Given multiple garments (top and bottom in this case, full-body garment not shown for this example), layout description, and a person image, our method enables multi-garment virtual try-on. Right: By freezing all the parameters, we optimize person feature embeddings extracted from the person encoder to improve person identity for a specific input image. The fine-tuning process recovers the information lost via agnostic computation.
VTO-UDiT Architecture. For image inputs, UNet encoders ($\mathbf{E}_{\mathbf{z}_t}$, $\mathbf{E}_{p}$, $\mathbf{E}_{g}$) extract features maps ($\mathcal{F}_{\mathbf{z}_t}$, $\mathcal{F}_{p}$, $\mathcal{F}_{g}^{\kappa}$) from $\mathbf{z}_t$, $I_{a}$, $I_{c}^{\kappa}$, respectively, with $\kappa \in \{\text{upper}, \text{lower}, \text{full}\}$. Diffusion timestep $t$ and garment attributes $y_{\text{gl}}$ are embedded with sinusoidal positional encoding, followed by a linear layer. The embeddings ($\mathcal{F}_{t}$ and $\mathcal{F}_{y_{\text{gl}}}$) are then used to modulate features with FiLM or concatenated to the key-value feature of self-attention in DiT similar to Following Imagen. Following TryOnDiffusion, spatially aligned features($\mathcal{F}_{\mathbf{z}_t}$, $\mathcal{F}_{p}$) are concatenated whereas $\mathcal{F}_{g}^{\kappa}$ are implicitly warped with cross-attention blocks. The final denoised image $\hat{\mathbf{x}}_0$ is obtained with decoder $\mathbf{D}_{\mathbf{z}_t}$, which is architecturally symmetrical to $\mathbf{E}_{\mathbf{z}_t}$.
Interactive Try-on Demo
Top Garment
Person or Try-on
Bottom Garment
Top Garment
Person or Try-on
Bottom Garment
Qualitative Results
Posed Garment VTO
Input Person
Input Garments
TryOnDiffusion
Ours
Layflat Garment VTO
Input Person
Input Garments
TryOnDiffusion
GP-VTON
LaDI-VTON
Ours-DressCode
Ours
Person Identity Preservation
Input
Person
Input
Garments
Fine-tuned
Full Model
Fine-tuned
Person Encoder
Ours Without
Fine-tuning
Ours With
Fine-tuning
Garment Layout Editing
Instruction: Tuck in the shirt
Input
Person
Input
Garments
Imagen
Editor
SDXL
Inpainting
DiffEdit
InstructP2P
P2P + NI
Ours
Instruction: Tuck out the shirt
Input
Person
Input
Garments
Imagen
Editor
SDXL
Inpainting
DiffEdit
InstructP2P
P2P + NI
Ours
Instruction: Roll up the Sleeve
Input
Person
Input
Garments
Imagen
Editor
SDXL
Inpainting
DiffEdit
InstructP2P
P2P + NI
Ours
Instruction: Roll down the Sleeve
Input
Person
Input
Garments
Imagen
Editor
SDXL
Inpainting
DiffEdit
InstructP2P
P2P + NI
Ours
Dress VTO
Input Person
Input Garment
Try-on Result
Limitations
There are several limitations for M&M VTO. First, our approach isn't designed for layout editing tasks, such as "Open the outer top", as no specific information is provided from inputs about what should be inpainted in the open area. Second, our method struggles with uncommon garment combinations found in the real world, like a long coat paired with skirts. Third, our model faces challenges when dealing with upper-body clothing from different images, e.g. pairing a shirt from one photo with an outer coat from another. This issue mainly stems from the difficulty in finding training pairs where one image clearly shows a shirt without any cover, while another displays the same shirt under an outer layer. As a result, the model struggles to accurately remove the shirt when it's covered by an outer layer during testing. Finally, note that our method visualizes how an item might look on a person, accounting for their body shape, but it doesn't yet include size information nor solves for exact fit.
BibTex
@InProceedings{Zhu_2024_CVPR_mmvto,
author={Zhu, Luyang and Li, Yingwei and Liu, Nan and Peng, Hao and Yang, Dawei and Kemelmacher-Shlizerman, Ira},
title={M&M VTO: Multi-Garment Virtual Try-On and Editing},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year={2024},
}
Special Thanks
This work was done when all authors were at Google. We would like to thank Chris Lee, Andreas Lugmayr, Innfarn Yoo, Chunhui Gu, Alan Yang, Varsha Ramakrishnan, Tyler Zhu, Srivatsan Varadharajan, Yasamin Jafarian and Ricardo Martin-Brualla for their insightful discussions. We are grateful for the kind support of the whole Google ARML Commerce organization. We thank Aurelia Di for her professional assistance on the garment layering Q&A survey design.











































































