Please leave us a star β if you find this work helpful.
-
[2025/11] π₯π₯ We release Qwen-Image, Wan2.1 and FLUX.1-dev Full/LoRA training code.
-
[2025/11] π₯π₯ Nano Banana Pro, FLUX.2-dev and Z-Image are added to all π Leaderboard.
-
[2025/10] π₯ Alibaba Group proves the effectiveness of Pref-GRPO on aligning LLMs in Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning. Thanks to all contributors!
-
[2025/9] π₯ Seedream-4.0, GPT-4o, Imagen-4-Ultra, Nano Banana, Lumina-DiMOO, OneCAT, Echo-4o, OmniGen2, and Infinity are added to all π Leaderboard.
-
[2025/8] π₯ We release π Leaderboard(English), π Leaderboard (English Long), π Leaderboard (Chinese Long) and π Leaderboard(Chinese).
- Clone this repository and navigate to the folder:
git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward/Pref-GRPO- Install the training package:
conda create -n PrefGRPO python=3.12
conda activate PrefGRPO
bash env_setup.sh fastvideo
git clone https://github.com/mlfoundations/open_clip
cd open_clip
pip install -e .
cd ..
mkdir images- Download Models
huggingface-cli download CodeGoat24/UnifiedReward-2.0-qwen3vl-8b
huggingface-cli download CodeGoat24/UnifiedReward-Think-qwen-7b
wget https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378/resolve/main/open_clip_pytorch_model.bin- Install vLLM
pip install vllm>=0.11.0
pip install qwen-vl-utils==0.0.14- Start server
bash vllm_utils/vllm_server_UnifiedReward_Think.sh we use training prompts in UniGenBench, as shown in "./data/unigenbench_train_data.txt".
# FLUX.1-dev
bash fastvideo/data_preprocess/preprocess_flux_rl_embeddings.sh
# Qwen-Image
pip install diffusers==0.35.0 peft==0.17.0 transformers==4.56.0
bash fastvideo/data_preprocess/preprocess_qwen_image_rl_embeddings.sh
# Wan2.1
bash fastvideo/data_preprocess/preprocess_wan_2_1_rl_embeddings.sh.sh# FLUX.1-dev
## UnifiedReward-Think for Pref-GRPO
bash scripts/full_train/finetune_prefgrpo_flux.sh
## UnifiedReward for Point Score-based GRPO
bash scripts/full_train/finetune_unifiedreward_flux.sh
# Qwen-Image
## UnifiedReward-Think for Pref-GRPO
bash scripts/full_train/finetune_prefgrpo_qwenimage.sh
## UnifiedReward for Point Score-based GRPO
bash scripts/full_train/finetune_unifiedreward_qwenimage.sh
# Wan2.1
## Pref-GRPO
bash scripts/full_train/finetune_prefgrpo_wan_2_1.shwe use test prompts in UniGenBench, as shown in "./data/unigenbench_test_data.csv".
# FLUX.1-dev
bash inference/flux_dist_infer.sh
# Qwen-Image
bash inference/qwen_image_dist_infer.sh
# Wan2.1
bash inference/wan_dist_infer.shThen, evaluate the outputs following UniGenBench.
If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.
Our training code is based on DanceGRPO, Flow-GRPO, and FastVideo.
We also use UniGenBench for T2I model semantic consistency evaluation.
Thanks to all the contributors!
@article{Pref-GRPO&UniGenBench,
title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2508.20751},
year={2025}
}
