| CARVIEW |
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
Tomas Hodan1, Eric Sauser1, Shugao Ma1, Bugra Tekin1
Abstract
Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. In this paper, we propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and an text-based manipulation stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the manipulation phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses and helps in generating realistic hand-object interactions. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the manipulation phase. For the textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the manipulation stage.
Examples of Unseen Objects
"The person picks up the large torus with their right hand, inspects it at chest level with both hands, and finally places it on the table using their right hand."
"The person picks up the elephant using their left hand and inspects it by rotating it in every direction."
"The person uses their right hand to pick up the train from the high table, passes it to their left hand, then passes it to another person on the left side using their left hand, and finally puts it back on the table using their right hand."
"The person picks up the mug with their right hand, lifts it to their mouth, drinks from it while holding it with their right hand, and then places it on the table with their right hand."
"The person picks up the medium cylinder with their right hand, passes it to their left hand, then passes it to someone on their left side with their left hand, and finally places it on the table with their right hand."
"The person picks up the apple from a high table using their right hand, passes it to someone on their right at chest level using their right hand, and finally places it on the table using the same hand."
Examples of Seen Objects
"The person lifts the large cube."
"The person looks through the binoculars."
"The person inspects the stanford bunny."
"The person flies the toy airplane."
"The person plays with the game controller."
"The person lifts the large sphere"
"The person uses the stamp."
"The person inspects the small cylinder."
Video
Text Descriptions for GRAB
We provide carefully annotated text descriptions for the GRAB dataset (Taheri et. al, 2020).
"The person picks up the apple from the table, eats the apple by taking several bites, and then places it back on the table using their right hand."
"The person uses their right hand to pick up the banana from the table, then peels it with their left hand, takes some bites, and finally places it back on the table with their right hand."
"The person picks up the binoculars with the right hand and then switches to holding them with both hands, looks around through the binoculars while bending their upper body and then places them back down on the table using the right hand."
"The person picks up the cup from the table using their right hand, pours the liquid inside several times onto the table, and finally places the cup back onto the table using their right hand."
"The person grabs the mouse from the table with their right hand and then uses it by moving it around the table and clicking several times."
"The person picks up the medium pyramid using both hands, with their right hand at the tip and left hand at the bottom, investigates it with their right hand at neck level, and then places it on the table with their right hand."
Diversity Evaluations
Different Objects (Seen),
Same Text Prompt
"Pick up the <object> and offhand it to the other hand."
Different Objects (Unseen),
Same Text Prompt
"Pick up the <object> and offhand it to the other hand."
Same Object,
Same Text Prompts
"Pick up the wineglass and drink from it."
Same Object,
Different Text Prompts
“Pick up the wineglass and drink from it.”
“Take the wineglass and make a toast.”
“Get the wineglass and pass it to someone else.”
Comparison to Baseline
IMoS (Ghosh et. al 2023)
"Inspect the apple."
Ours
"Inspect the apple."
IMoS (Ghosh et. al 2023)
"Pass the piggybank."
Ours
"Pass the piggybank."
Generation with Image-Based Pose Estimate
Source Image
Generated Sequences
"Pass the bleach."
"Drink from the bleach."
"Pick up and put down the bleach."
Source Image
Generated Sequences
"Pass the canned meat."
"Pick up and put down the canned meat."
BibTeX
@inproceedings{christen2024diffh2o,
title={DiffH2O: Diffusion-based synthesis of hand-object interactions from textual descriptions},
author={Christen, Sammy and Hampali, Shreyas and Sener, Fadime and Remelli, Edoardo and Hodan, Tomas and Sauser, Eric and Ma, Shugao and Tekin, Bugra},
booktitle={SIGGRAPH Asia 2024 Conference Papers},
year={2024}
}