| CARVIEW |
ISA4D: Interspatial Attention for
Efficient 4D Human Video Generation
Yang Zheng2, Changan Chen2, Yebin Liu1, Gordon Wetzstein2
* Equal Contribution
SIGGRAPH 2025
DEMO Video
Pipeline
Overview of our diffusion transformer architecture for 4D human generation. Taking the reference image, SMPL condition, camera poses, and background videos as input. Our framework starts by tokenizing 3D SMPL conditions. In parallel, 2D video tokens are optionally composited with background elements and processed through a cascade of disentangled spatial and temporal transformer blocks, enabling efficient modeling of spatio-temporal relationships. These tokens then seamlessly interact with pose tokens via our Interspatial Transformer Block, facilitating effective 3D-aware conditioning. The generated features are further enhanced through Plücker camera embeddings for precise view control and interact with reference image features through cross attention to ensure consistent identity preservation. The entire framework is optimized using a flow-based diffusion formulation, enabling high-quality 4D human generation with controllable pose, viewpoint, and identity.
Method
Results
Multi-Human Generation
Camera Control Generation
Background Composition Generation
Single Human Generation
Face Generation
Upper-body Generation
Ethics
Citation
"ISA4D: Interspatial Attention for Efficient 4D Human Video Generation".
ACM Trans. Graph. (Proc. SIGGRAPH) 2025.
@article{shao2024isa4d,
title={ISA4D: Interspatial Attention for Efficient 4D Human Video Generation},
author={Shao, Ruizhi and Xu, Yinghao and Shen, Yujun and Yang, Ceyuan and Zheng, Yang and Chen, Changan and Liu, Yebin and Wetzstein, Gordon},
journal={ACM Transactions on Graphics (TOG)},
year={2025},
publisher={ACM New York, NY, USA}
}