| CARVIEW |
Gallery of SnapMoGen Dataset
MoMask++ Generation Results
"The person crouches low with knees bent and arms extended sideways like wings. They begin with small hops, gradually increasing height and breadth of their arm flaps. Their torso leans forward as they simulate taking off, rising onto the balls of their feet and stretching limbs outward. Movements are fluid and soaring, embodying the effort and grace of flight."
"The person crouches deeply with knees bent and arms hanging between the legs. They push off the ground into a high, forward-directed hop, landing in a squat with hands swinging upward. They repeat this frog-like jump several times in a rhythmic, bouncy motion, using their arms to assist each leap while maintaining a playful posture."
"The person takes a step and suddenly loses footing, sliding forward with legs split unevenly. Their arms flail outward for balance, body tilted backward. They wobble with quick foot adjustments, attempting to stabilize. After a near fall, they bend knees low and regain control, breathing out with relief and brushing off their clothes."
"With bent knees and raised heels, the person carefully tiptoes across an invisible floor. Each step is cautious and silent, arms out for balance. Occasionally, they freeze mid-step, listening for sounds, then continue with even more care. Their body leans slightly forward, head tilted to scan for obstacles as they proceed."
"The person lies flat on their belly and begins simulating swimming. Arms stretch forward and pull back in a breaststroke pattern while their legs kick alternately behind them. They lift their head occasionally to 'breathe,' and their whole body undulates in sync with each stroke, mimicking smooth swimming in water, despite being grounded."
* Unless otherwise mentioned, prompts are rewritten into expressive text descriptions before being fed into MoMask++.
Ablation Analysis
Impact on VQ Reconstruction
We investigate the impact of the number of residual layers and the number of tokens on VQ reconstruction quality. Additionally, we compare our method against the 6-layer VQ used in MoMask. The number of tokens is calculated based on the encoding of a 320-frame motion sequence. Results show that our approach effectively captures high-fidelity motion details by increasing the number of layers and tokens, enabling better modeling of holistic motion patterns compared to the RVQ used in MoMask.
Generation
We analyze the effect of residual tokens, multi-scale quantization, and prompt rewriting on the final motion generation quality. As shown below, using only a single VQ token sequence (w/o residual VQ) or multiple full-scale token sequences of the same length (w/o multi-scale VQ) results in limited understanding of nuanced text prompts. Moreover, when casual user prompts are directly used (w/o prompt rewriting) for generation without rewriting, the model exhibits significant semantic degradation.
Comparisons
We show one example (#1) from SnapMoGen test set, and two examples using in-the-wild user prompts (#2, 3). For the later two cases, all models take the re-rewritted prompts as input.
Real-world Application
SnapMoGen has led to a launched text2motion feature in LensStudio (v5.11.0) of Snap VR.
Limitations
"The person drops low, crawling forward on hands and knees with urgency. They weave their torso and duck their head side to side as if narrowly avoiding invisible laser beams. Arms stretch out to maintain balance while legs push powerfully, body tense with alertness. Their movements are fluid but deliberate, moving forward cautiously with sharp, sudden dodges."
"Motion artifacts persist (e.g., sliding, jittering)."
"The person skips forward energetically, bouncing on alternating feet with light, rhythmic hops. Their arms move in circular patterns as if juggling several invisible balls, tossing them from hand to hand. Their torso sways rhythmically, and they occasionally look upward or to the side to track the imaginary objects, ending the sequence with a playful spin."
"Missing semantic cues (e.g., junggling balls)."
"Standing tall, the person reaches both arms toward the sky with a deep inhale. They bend forward slowly at the waist, touching the ground with fingertips. Then they step one leg back into a lunge, lifting the arms overhead in a stretch. They transition into downward dog, hold it briefly, then step forward and return to standing."
"Fail on rare motions."
Related Motion Generation Works 🚀🚀
TM2T: Learning text2motion and motion2text reciprocally through discrete token and language model.
InterMask: Human interaction generation from text prompts through generative masking.
MoMask: generative masked modeling of 3D human motions.
BibTeX
@misc{snapmogen2025,
title={SnapMoGen: Human Motion Generation from Expressive Texts},
author={Chuan Guo and Inwoo Hwang and Jian Wang and Bing Zhou},
year={2025},
eprint={2507.09122},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.09122},
}

