| CARVIEW |
MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations
Abstract
In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples. The resultant motion representation not only captures diverse motion skills but also offers a robust and intuitive interface for various applications. We demonstrate the versatility of MoConVQ through several applications: universal tracking control from various motion sources, interactive character control with latent motion representations using supervised learning, physics-based motion generation from natural language descriptions using the GPT framework, and, most interestingly, seamless integration with large language models (LLMs) with in-context learning to tackle complex and abstract tasks.
Pipeline
Tracking Motions from Different Sources
Our motion representation captures diverse motion skills on a large unstructured dataset. We demonstrate its capiticy by tracking motions from different sources: unseen dataset, retults of video-based pose estimation and output of kinematic motion generator.
Noisy dance from HDM05.
Output of HybrIK.
Tracking output of motion latent diffution model.
Motion Generation with MoConGPT
Based on the discrete essence of the learned motion representation, our framework can integrate with Generative Pretrained Transformer (GPT) to generate diverse motions. The generation can also be controlled by natural language.
Text2Motion with MoConGPT
a person get down and crawls.
a person raising left hand and putting down right hand for seconds, then he jumps up and down for seconds"
a person slightly crouches down and walks forward, then he stand still.
a man is kicking with right leg .
a man walks forward with his right hands up .
Integration with LLM
Our framework can also seamlessly integrate with large language models (LLMs) with in-context learning. We first demonstrate its capiticy of zero-shot text2motion generation, followed by showcasing its effectiveness in completing complex and abstract tasks.
In-context learning with Claude-2
a person picking up a item and about to place it down.
a person walks forward and sits down.
a person walks forward for a long time and kicks, then he begins to dance .
Abstract Task: Walk in Square Trajector
Abstract Task: Imagined Senario
Other Tasks
Interactive control
Walking and running under user control.
Respond to external perturbations.
Latent Motion Matching
Video of latent motion matching 1.
Video of latent motion matching 2.