CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://dsaurus.github.io/isa4d/ x-github-request-id: 7C2E:328FD3:8D96CE:9EE1CC:69527A4D accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 12:55:41 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210044-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767012942.740536,VS0,VE196 vary: Accept-Encoding x-fastly-request-id: 26ddfdf9f50e81d24dbef235bd917fc06ee5d6a3 content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Wed, 28 May 2025 05:32:13 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"68369fdd-93f8" expires: Mon, 29 Dec 2025 13:05:42 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 6280:3827E5:8DDD3A:9F3F05:69527A4D accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 12:55:42 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210044-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767012942.966202,VS0,VE226 vary: Accept-Encoding x-fastly-request-id: 5fb0f621c6b768bd0412481541e1cfe7f7503132 content-length: 5982 ISA4D

ISA4D: Interspatial Attention for

Efficient 4D Human Video Generation

Ruizhi Shao^1,2*, Yinghao Xu^2*, Yujun Shen³, Ceyuan Yang⁴,
Yang Zheng², Changan Chen², Yebin Liu¹, Gordon Wetzstein²

¹Tsinghua University ²Stanford University ³Ant Research ⁴ByteDance
^* Equal Contribution

SIGGRAPH 2025

[Paper] | [Code]

DEMO Video

Pipeline

Overview of our diffusion transformer architecture for 4D human generation. Taking the reference image, SMPL condition, camera poses, and background videos as input. Our framework starts by tokenizing 3D SMPL conditions. In parallel, 2D video tokens are optionally composited with background elements and processed through a cascade of disentangled spatial and temporal transformer blocks, enabling efficient modeling of spatio-temporal relationships. These tokens then seamlessly interact with pose tokens via our Interspatial Transformer Block, facilitating effective 3D-aware conditioning. The generated features are further enhanced through Plücker camera embeddings for precise view control and interact with reference image features through cross attention to ensure consistent identity preservation. The entire framework is optimized using a flow-based diffusion formulation, enabling high-quality 4D human generation with controllable pose, viewpoint, and identity.

Method

Results

Multi-Human Generation

Camera Control Generation

Background Composition Generation

Single Human Generation

Face Generation

Upper-body Generation

Ethics

Our research presents advanced generative AI capabilities for human video synthesis. We firmly oppose the misuse of our technology for generating manipulated content of real individuals. While our model enables the creation and editing of photorealistic digital humans, we strongly condemn any application aimed at spreading misinformation, damaging reputations, or creating deceptive content. We acknowledge the ethical considerations surrounding this technology and are committed to responsible development and deployment that prioritizes transparency and prevents harmful applications.

Citation

Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein.
"ISA4D: Interspatial Attention for Efficient 4D Human Video Generation".
ACM Trans. Graph. (Proc. SIGGRAPH) 2025.

@article{shao2024isa4d,
title={ISA4D: Interspatial Attention for Efficient 4D Human Video Generation},
author={Shao, Ruizhi and Xu, Yinghao and Shen, Yujun and Yang, Ceyuan and Zheng, Yang and Chen, Changan and Liu, Yebin and Wetzstein, Gordon},
journal={ACM Transactions on Graphics (TOG)},
year={2025},
publisher={ACM New York, NY, USA}
}

Original Source | Taken Source