| CARVIEW |
Learning to Listen:
Modeling Non-Deterministic Dyadic Facial Motion
CVPR 2022
|
Ng UC Berkeley |
Joo Seoul National University |
Hu Pinscreen |
Li Pinscreen |
Darrell UC Berkeley |
Kanazawa UC Berkeley |
Ginosar UC Berkeley |
We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations.
The goal of our work is to model the conversational dynamics between a speaker and listener. We introduce a novel motion VQ-VAE that allows us to output nondeterministic listener motion sequences in an autoregressive manner. Given speaker motion and audio as inputs, our approach generates realistic, synchronous, and diverse listener motion sequences that outperform prior SOTA.
(1) To represent the manifold of realistic listner facial motion, we extend VQ-VAE to the domain of motion
synthesis. The learned discrete representation of motion enables us to model the next time step of motion as a
multinomial distribution.
(2) We use a motion-audio cross-modal transformer that learns to fuse the speaker's audio and facial
motion.
(3) We then learn an autoregressive transformer-based predictor that takes as input the speaker and listener
embeddings and outputs a distribution over possible synchronous and realistic listener responses, from which
we can sample multiple trajectories.
Highlights
Given a speaker's facial motion and audio, our method generates synchronous, realistic listeners.
|
|
|
|
|
|
Multiple Samples
Our method generates multiple possible listener trajectories from a single speaker input.
|
|
|
|
|
Comparison vs. Baselines
Our method outperforms existing baselines.
|
|
|
|
|
Comparison vs. Ablations
Ablations demonstrate our method's strength.
|
|
|
|
|
Fun Results
Since our method generalizes to unseen speakers, we can generate results on novel listeners. We thank Devi Parikh for allowing us to use her podcast videos.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@article{ng2022learning2listen,
|
|---|
The authors would like to thank Justine Cassell, Alyosha Efros, Alison Gopnik, Jitendra Malik, and the Facebook FRL team for many insightful conversations and comments. Dave Epstein and Karttikeya Mangalam for Transformer advice. Ruilong Li and Ethan Weber for technical support. The work of Ng and Darrell is supported in part by DoD including DARPA’s XAI, LwLL, Machine Common Sense and/or SemaFor programs, which also supports Hu and Li in part, as well as BAIR’s industrial alliance programs. Ginosar’s work is funded by the NSF under Grant #2030859 to the Computing Research Association for the CIFellows Project. Parent authors would like to thank their children for the daily reminder that they should learn how to listen.