| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://jackailab.github.io/Projects/UnityVideo/
access-control-allow-origin: *
strict-transport-security: max-age=31556952
expires: Mon, 29 Dec 2025 19:47:43 GMT
cache-control: max-age=600
x-proxy-cache: MISS
x-github-request-id: 70FE:3157C7:92B36B:A4B4FF:6952D887
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 19:37:43 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210063-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767037063.444332,VS0,VE201
vary: Accept-Encoding
x-fastly-request-id: 8571eab0aec51ae58b18579c5912052eabef29e9
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 09 Dec 2025 07:05:02 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"6937ca1e-8088"
expires: Mon, 29 Dec 2025 19:47:43 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: 714A:123DE:93D741:A5D957:6952D887
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 19:37:43 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210063-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767037064.659417,VS0,VE214
vary: Accept-Encoding
x-fastly-request-id: bb9e77e2d1c812a634750f85f82d34daeec5d2ef
content-length: 4540
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
UnityVideo: Unified Multi-Modal Multi-Task Learning
UnityVideo: Unified Multi-Modal Multi-Task Learning
for Enhancing World-Aware Video Generation
Jiehui Huang1,†
Yuechen Zhang2
Xu He3
Yuan Gao4
Zhi Cen4
Bin Xia2 Yan Zhou4 Xin Tao4 Pengfei Wan4 Jiaya Jia1 ✉
Bin Xia2 Yan Zhou4 Xin Tao4 Pengfei Wan4 Jiaya Jia1 ✉
1HKUST
2CUHK
3Tsinghua University
4Kling Team, Kuaishou Technology
Overall Method
Figure: Overview of the UnityVideo Framework
⏳ It may take some time to load all videos. Thank you for your patience!
✨ Method Showcases
JointGen - Text to Video
T2A 1 - RGB
T2A 1 - Skeleton
T2A 2 - RGB
T2A 2 - Segmentation
T2A 3 - RGB
T2A 3 - Segmentation
T2A 4 - RGB
T2A 4 - RAFT
Estimator - Video to Modality
V2F 1 - RGB
V2F 1 - Skeleton
V2F 2 - RGB
V2F 2 - Skeleton
V2F 4 - RGB
V2F 4 - RAFT
V2F 3 - Depth
V2F 3 - DensePose
ControGen - Modality to Video
F2V 1 - Depth
F2V 1 - RGB
F2V 2 - RAFT
F2V 2 - RGB
🔍 Baseline Comparisons
Case 0 - Wan
Case 0 - UnityVideo
Case 1 - Hunyuan
Case 1 - UnityVideo
Case 2 - Hunyuan
Case 2 - UnityVideo
Case 3 - Hunyuan
Case 3 - UnityVideo
Case 4 - Hunyuan
Case 4 - UnityVideo
Case 5 - Hunyuan
Case 5 - UnityVideo
Case 6 - Hunyuan
Case 6 - UnityVideo
Case 7 - Hunyuan
Case 7 - UnityVideo
Case 8 - UnityVideo
Case 8 - VACE
Case 9 - UnityVideo
Case 9 - VACE