2025-07-28
- Added support for FLUX.
- Added support for CLIPScore as reward model.
- Introduced
config.sample.same_latent
to control whether the same noise is reused for identical prompts, addressing Issue #7.
2025-05-15
- 🔥We showcase image examples from three tasks and their training evolution at https://gongyeliu.github.io/Flow-GRPO. Check them out!
- 🔥We now provide an online demo for all three tasks at https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO. You're welcome to try it out!
Task | Model |
---|---|
GenEval | 🤗GenEval |
Text Rendering | 🤗Text |
Human Preference Alignment | 🤗PickScore |
Clone this repository and install packages.
git clone https://github.com/yifan123/flow_grpo.git
cd flow_grpo
conda create -n flow_grpo python=3.10.16
pip install -e .
To avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.
Models
- SD3.5:
stabilityai/stable-diffusion-3.5-medium
- Flux:
black-forest-labs/FLUX.1-dev
Reward Models
- PickScore:
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
yuvalkirstain/PickScore_v1
- CLIPScore:
openai/clip-vit-large-patch14
- Aesthetic Score:
openai/clip-vit-large-patch14
The steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.
Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.
Please install paddle-ocr:
pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein
Then, pre-download the model using the Python command line:
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)
PickScore requires no additional installation.
Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.
Since sglang
may conflict with other environments, we recommend creating a new conda environment.
conda create -n sglang python=3.10.16
conda activate sglang
pip install "sglang[all]"
We use sglang to deploy the reward service. After installing sglang, please run the following command to launch UnifiedReward:
python -m sglang.launch_server --model-path CodeGoat24/UnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85
Please install imagereward:
pip install image-reward
pip install git+https://github.com/openai/CLIP.git
Single-node training:
# sd3
bash scripts/single_node/grpo.sh
# flux
bash scripts/single_node/grpo_flux.sh
Multi-node training for SD3:
# Master node
bash scripts/multi_node/sd3/main.sh
# Other nodes
bash scripts/multi_node/sd3/main1.sh
bash scripts/multi_node/sd3/main2.sh
bash scripts/multi_node/sd3/main3.sh
Multi-node training for Flux:
# Master node
bash scripts/multi_node/flux/main.sh
# Other nodes
bash scripts/multi_node/flux/main1.sh
bash scripts/multi_node/flux/main2.sh
bash scripts/multi_node/flux/main3.sh
Single-node training:
bash scripts/single_node/dpo.sh
bash scripts/single_node/sft.sh
Multi-node training:
Please update the entry Python script and config file names in the scripts/multi_node
bash file.
To integrate a new model into this framework, please follow the steps below:
1. Add the following files adapted for your model:
-
flow_grpo/diffusers_patch/sd3_pipeline_with_logprob.py
: This file is adapted from pipeline_stable_diffusion_3.py. You can refer to diffusers for your model. -
scripts/train_sd3.py
: This script is based on train_dreambooth_lora_sd3.py from the DreamBooth examples. -
flow_grpo/diffusers_patch/sd3_sde_with_logprob.py
: This file handles SDE sampling. In most cases, you don't need to modify it. However, if your definitions ofdt
orvelocity
differ in sign or convention, please adjust accordingly.
2. Verify SDE sampling:
Set noise_level = 0
in sde_demo.py to check whether the generated images look normal. This helps verify that your SDE implementation is correct.
3. Ensure on-policy consistency:
Set config.sample.num_batches_per_epoch = 1
and config.train.gradient_accumulation_steps = 1
to enforce a purely on-policy setup, where the model collecting samples is identical to the one being trained.
Under this setting, the ratio should remain exactly 1. If it's not, please check whether the sampling and training code paths differ—for example, through use of torch.compile
or other model wrappers—and make sure both share the same logic.
4. Tune reward behavior:
Start with config.train.beta = 0
to observe if the reward increases during training. You may also need to adjust the noise level here based on your model. Other hyperparameters are generally model-agnostic and can be kept as default.
For multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight. For example:
{
"pickscore": 0.5,
"ocr": 0.2,
"aesthetic": 0.3
}
This means the final reward is a weighted sum of the individual rewards.
The following reward models are currently supported:
- Geneval evaluates T2I models on complex compositional prompts.
- OCR provides an OCR-based reward.
- PickScore is a general-purpose T2I reward model trained on human preferences.
- DeQA is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.
- ImageReward is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.
- QwenVL is an experimental reward model using prompt engineering.
- Aesthetic is a CLIP-based linear regressor predicting image aesthetic scores.
- JPEG_Compressibility measures image size as a proxy for quality.
- UnifiedReward is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.
You can adjust the parameters in config/grpo.py
to tune different hyperparameters. An empirical finding is that config.sample.train_batch_size * num_gpu / config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48
, i.e., group_number=48
, group_size=24
.
Additionally, setting config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch // 2
.
This repo is based on ddpo-pytorch and diffusers. We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent ddpo-pytorch repo.
@article{liu2025flow,
title={Flow-grpo: Training flow matching models via online rl},
author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
journal={arXiv preprint arXiv:2505.05470},
year={2025}
}