| CARVIEW |
CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis
Abstract
In this work, we focus on a novel task of category-level functional hand-object manipulation synthesis covering both rigid and articulated object categories. Given an object geometry, an initial human hand pose as well as a sparse control sequence of object poses, our goal is to generate a physically reasonable hand-object manipulation sequence that performs like human beings. To address a such challenge, we first design CAnonicalized Manipulation Spaces (CAMS), a two-level space hierarchy that canonicalizes the hand poses in an object-centric and contact-centric view. Benefiting from the representation capability of CAMS, we then present a two-stage framework for synthesizing human-like manipulation animations. Our framework achieves state-of-the-art performance for both rigid and articulated categories with impressive visual effects.
Video
Method
Our framework mainly consists of a CVAE-based planner module and an optimization-based synthesizer module. Given the generation condition as the input, the planner first generates a per-stage CAMS representation containing contact reference frames and sequences of finger embedding. Then the synthesizer optimizes the whole manipulation animation based on the CAMS embedding.
CAMS Representation of Hand Motion
CAnonicalized Manipulation Spaces have a two-level canonicalization for manipulation representation. At the root level, the canonicalized contact targets (top right) describe the discrete contact information. At the leaf level, the canonicalized finger embedding (bottom right) transforms finger motion from global space into local reference frames defined on the contact targets.
CAMS-CVAE
A CVAE-based motion planner module takes take configuration and object shape as inputs, and generates a CAMS sample of motion corresponding to the input.
Optimization-based Synthesizer
The synthesizer adopts a two-stage optimization method that first optimizes the MANO pose parameters to best fit the CAMS finger embedding and then optimizes the contact effect to improve physical plausibility.
Experiments
Result
Kettle
Input
View 2
View 3
Laptop
Input
View 2
View 3
Pliers
Input
View 2
View 3
Scissors
Input
View 2
View 3
Mode Diversity
Laptop
View 2
View 3
Pliers
Scissors
Comparison
Ours vs GraspTTA vs ManipNet
View 2
View 3
Ours vs GraspTTA vs ManipNet
View 2
View 3
BibTeX
@InProceedings{Zheng_2023_CVPR,
author = {Zheng, Juntian and Zheng, Qingyuan and Fang, Lixing and Liu, Yun and Yi, Li},
title = {CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {585-594}
}
Contact
If you have any questions, please feel free to contact us:
Juntian Zheng: jt-zheng20@mails.tsinghua.edu.cn
Lixing Fang: flx20@mails.tsinghua.edu.cn