| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://byteaigc.github.io/X-Actor/
x-github-request-id: D19C:2C10E1:846307:94C90F:695201CA
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 04:21:30 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210068-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766982090.301030,VS0,VE197
vary: Accept-Encoding
x-fastly-request-id: 81b31e508334bfaa78e78fdd0c3503dffe2fef3d
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Mon, 08 Sep 2025 21:23:48 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"68bf4964-238e"
expires: Mon, 29 Dec 2025 04:31:30 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 119F:2685F2:82C816:932D2D:695201CA
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 04:21:30 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210068-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766982091.517194,VS0,VE211
vary: Accept-Encoding
x-fastly-request-id: 0be357445c307a24d23758db950a567d3847add0
content-length: 2461
X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio
Siggrpah Asia 2025, Conference ByteDance Inc.
X-Actor decouples video synthesis from audio-conditioned motion generation, operating in a compact, expressive, and identity-agnostic facial motion latent space. Specifically, we encode talking video frames into sequences of motion latents using a pretrained motion encoder. These latents are corrupted with asynchronously sampled noise levels and denoised using an autoregressive diffusion model trained with a diffusion-forcing scheme. Within each motion chunk, we apply full self-attention to preserve fine-grained expressiveness, while causal cross-chunk attention ensures long-range temporal coherence and context awareness. Each motion token attends to frame-aligned audio features via windowed cross-attention, enabling accurate lip synchronization and capturing transient emotional shifts. At inference time, we autoregressively and iteratively predict future motion tokens with a monotonically decreasing noise schedule over the historical motion context. Finally, alongside a single reference image, we render the predicted motion sequence into high-fidelity, emotionally rich video frames using a pretrained diffusion-based video generator.
X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio
Chenxu Zhang, Zenan Li, Hongyi Xu, You Xie, Xiaochen Zhao, Tianpei Gu, Guoxian Song, Xin Chen, Chao Liang, Jianwen Jiang, Linjie LuoSiggrpah Asia 2025, Conference ByteDance Inc.
Pipeline
Motion Diversity: Single Reference Image, Multiple Audios
More Results
Comparisons with Prior Methods
Ablation Study
Ethics Concerns
The images and audios used in the demos are from public sources or generated by models, and are solely used to demonstrate the capabilities of this research work. If there are any concerns, please contact us (chenxuzhang@bytedance.com) and we will delete it in time.