| CARVIEW |
Towards Immersive Human-X Interaction: A Real-Time
Framework for Physically Plausible Motion Synthesis
*Corresponding Authors
Abstract
Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
Video
Method
Figure 2. Overview of our immersive real-time interaction synthesis pipeline: (a) Actor Motion Capture: A human actor's movements are recorded at 30 fps by an RGB-D camera and translated into 3D poses, which are then retargeted to a humanoid character. (b) Realistic Reactor Motion Generation: An auto-regressive diffusion model, guided by optional text prompts (e.g., “Dancing is what to do”), generates plausible reaction motions. These motions are tracked by an actor-aware controller, which uses proprioception signals to ensure realistic, synchronized interactions. (c) Real-time VR Interface: The generated and tracked motions are rendered in Isaac Gym, providing both a third-person view and a binocular VR view.
Visualization Results
Figure 3. Visualization of human reaction synthesis results. Blue for actors and Orange for reactors. Compared to CAMDM (top row), Human-X (bottom row) achieves more complete hand contact in tasks such as face-hitting and handshaking. Additionally, its foot movement appears more natural, as highlighted in the red and green circles.