| CARVIEW |
Rethinking Diffusion for Text-Driven Human Motion Generation
Abstract
Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.
Architecture Overview
Results Gallery
(Our method is capable of generating high-quality, textual instruction-following 3D human motions)
A person waves with both arms above head.
A man slowly walking forward.
The toon is standing, swaying a bit, then raising their left wrist as to check the time on a watch.
A man walks forward before stumbling backwards and the continues walking forward.
The person fell down and is crawling away from someone.
The sim reaches to their left and right, grabbing an object and appearing to clean it.
The man takes 4 steps backwards.
She jumps up and down, kicking her heels in the air.
A person who is standing lifts his hands and claps them four times.
A person who is running, stops, bends over, and looks down while taking small steps, then resumes running.
A person walks slowly forward holding handrail with left hand.
The person kick his left foot up and both hands up in counterclockwise circle and stop.
A person steps to the left sideways.
A person is walking across a narrow beam.
A person does a drumming movement with both hands.
Comparing to other methods
(Our method generates motion that is more realistic and more accurately follows the fine details of the textual condition)
We visually compare our method with three SOTA baseline methods: T2M-GPT[Zhang et al. (2023)], ReMoDiffuse[Zhang et al. (2023)], and MoMask[Guo et al. (2024)]. We report both ReMoDiffuse animation from raw genertaion and after their additional temporal filtering smoothing postprocess for fair comparison.
A man steps forward, swings his leg, and turns all the way around.
ReMoDiffuse with Temporal Filter Postprocess
Ours
ReMoDiffuse
T2M-GPT
MoMask
A person doing a forward kick with each leg.
ReMoDiffuse with Temporal Filter Postprocess
Ours
ReMoDiffuse
T2M-GPT
MoMask
A man walks forward and then trips towards the right.
ReMoDiffuse with Temporal Filter Postprocess
Ours
ReMoDiffuse
T2M-GPT
MoMask
A person walks forward, stepping up with their right leg and down with their left, then turns to their left and walks, then turns to their left and starts stepping up.
ReMoDiffuse with Temporal Filter Postprocess
Ours
ReMoDiffuse
T2M-GPT
MoMask
A person fastly swimming forward.
ReMoDiffuse with Temporal Filter Postprocess
Ours
ReMoDiffuse
T2M-GPT
MoMask
Temporal Editing
Our method can be applied beyond the scope of standard text-to-motion generation to temporal editing. Here we present the temporal editing results (prefix, in-between, suffix) using our method. The input motion clips are presented without coloring (grey scale) and the edited contents are in full coloring.
Prefix
Original: "A person walks in a circular counterclockwise direction one time before returning back to his/her original position."
+ Prefix: "A person dances around."
Original: "The man takes 4 steps backwards."
+ Prefix: "The man waves both hands."
In-Between
Original: "The person fell down and is crawling away from someone."
+ In-Between: "The person jumps up and down."
Original: "A person walks ina curved line."
+ In-Between: "The person takes a small jumps."
Suffix
Original: "A person is walking across a narrow beam."
+ Suffix: "A person raises his hands."
Original: "A man rises from the ground, walks in a circle and
- sits back down on the ground."
+ Suffix: "A man starts to run."
Generation Diversity
(Our method can generate diverse motions while maintaining high quality)
The person was pushed but did not fall.
A person walks around.
A person jumps up and then lands.
Ablation Studies
Prompt: A person walks in a circular counterclockwise direction one time before returning back to his/her original position.
Full Method
The visualization validates that our full method generates more realistic motion and follows finer details of the textual instruction.
W/o Autoregressive Modeling
Without autoregressive modeling, the generated motion fails to fully align with the textual instructions.
W/o Motion Representation Reformation
Without Motion Representation Reformation, model outputs shaking and inaccurate motion.
BibTeX
@article{meng2024rethinking,
title={Rethinking Diffusion for Text-Driven Human Motion Generation},
author={Meng, Zichong and Xie, Yiming and Peng, Xiaogang and Han, Zeyu and Jiang, Huaizu},
journal={arXiv preprint arXiv:2411.16575},
year={2024}
}