| CARVIEW |
Select Language
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Wed, 04 Jun 2025 21:56:04 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"6840c0f4-13637"
expires: Tue, 30 Dec 2025 12:49:42 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: B60C:292AC1:A122A1:B4F251:6953C80E
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 12:39:43 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210026-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767098383.875091,VS0,VE233
vary: Accept-Encoding
x-fastly-request-id: 7017ade0174cb79a969e90472533691c6bcf9e29
content-length: 7346
VADER: Video Diffusion Alignment via Reward Gradient
Video Diffusion Alignment via Reward Gradient
Mihir Prabhudesai*
Zheyang Qin*
Russell Mendonca*
Katerina Fragkiadaki
Deepak Pathak
Carnegie Mellon University
Carnegie Mellon University
Abstract
We have made significant progress towards building foundational video diffusion models. As these models are trained using large-scale unsupervised data, it has become crucial to adapt these models to specific downstream tasks, such as video-text alignment or ethical video generation. Adapting these models via supervised fine-tuning requires collecting target datasets of videos, which is challenging and tedious. In this work, we instead utilize pre-trained reward models that are learned via preferences on top of powerful discriminative models. These models contain dense gradient information with respect to generated RGB pixels, which is critical to be able to learn efficiently in complex search spaces, such as videos. We show that our approach can enable alignment of video diffusion for aesthetic generations, similarity between text context and video, as well long horizon video generations that are 3X longer than the training sequence length. We show our approach can learn much more efficiently in terms of reward queries and compute than previous gradient-free approaches for video generation.
Aesthetic and HPS Reward
A rabbit playing an upright piano in a quaint cafe while patrons enjoy their coffee.
A person picking fresh vegetables in a garden with morning dew glistening.
A teacher writing on a chalkboard in a classroom, explaining a lesson to students.
A person writing poetry in a cozy room with a fireplace crackling nearby.
PickScore Reward
A peaceful deer eating grass in a thick forest, with sunlight filtering through the trees.
A strong lion and a graceful lioness resting together in the shade of a big tree on a wide grassland.
HPS Reward
Object Removal Reward
Removing books with the YOLOS object detection model.
V-JEPA Reward
Improve temporal consistency for Stable Video Diffusion, an image-to-video model.
Aesthetic and ViCLIP Reward
Improve text-video alignment for VideoCrafter2.
A rabbit hitting the drums with its paws in a garden with blooming flowers.
A dog playing a piano with its paws in a living room illuminated by fairy lights.