Flow-GRPO:
Training Flow Matching Models via Online RL

Changelog

2025-07-28

Added support for FLUX.
Added support for CLIPScore as reward model.
Introduced config.sample.same_latent to control whether the same noise is reused for identical prompts, addressing Issue #7.

2025-05-15

🔥We showcase image examples from three tasks and their training evolution at https://gongyeliu.github.io/Flow-GRPO. Check them out!
🔥We now provide an online demo for all three tasks at https://huggingface.co/spaces/jieliu/SD3.5-M-Flow-GRPO. You're welcome to try it out!

🤗 Model

Task	Model
GenEval	🤗GenEval
Text Rendering	🤗Text
Human Preference Alignment	🤗PickScore

🚀 Quick Started

1. Environment Set Up

Clone this repository and install packages.

git clone https://github.com/yifan123/flow_grpo.git
cd flow_grpo
conda create -n flow_grpo python=3.10.16
pip install -e .

2. Model Download

To avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.

Models

SD3.5: stabilityai/stable-diffusion-3.5-medium
Flux: black-forest-labs/FLUX.1-dev

Reward Models

PickScore:
- laion/CLIP-ViT-H-14-laion2B-s32B-b79K
- yuvalkirstain/PickScore_v1
CLIPScore: openai/clip-vit-large-patch14
Aesthetic Score: openai/clip-vit-large-patch14

3. Reward Preparation

The steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.

GenEval

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

OCR

Please install paddle-ocr:

pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein

Then, pre-download the model using the Python command line:

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)

Pickscore

PickScore requires no additional installation.

DeQA

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

UnifiedReward

Since sglang may conflict with other environments, we recommend creating a new conda environment.

conda create -n sglang python=3.10.16
conda activate sglang
pip install "sglang[all]"

We use sglang to deploy the reward service. After installing sglang, please run the following command to launch UnifiedReward:

python -m sglang.launch_server --model-path CodeGoat24/UnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85

ImageReward

Please install imagereward:

pip install image-reward
pip install git+https://github.com/openai/CLIP.git

4. Start Training

GRPO

Single-node training:

# sd3
bash scripts/single_node/grpo.sh
# flux
bash scripts/single_node/grpo_flux.sh

Multi-node training for SD3:

# Master node
bash scripts/multi_node/sd3/main.sh
# Other nodes
bash scripts/multi_node/sd3/main1.sh
bash scripts/multi_node/sd3/main2.sh
bash scripts/multi_node/sd3/main3.sh

Multi-node training for Flux:

# Master node
bash scripts/multi_node/flux/main.sh
# Other nodes
bash scripts/multi_node/flux/main1.sh
bash scripts/multi_node/flux/main2.sh
bash scripts/multi_node/flux/main3.sh

DPO / OnlineDPO / SFT / OnlineSFT

Single-node training:

bash scripts/single_node/dpo.sh
bash scripts/single_node/sft.sh

Multi-node training:

Please update the entry Python script and config file names in the scripts/multi_node bash file.

How to Support Other Models

To integrate a new model into this framework, please follow the steps below:

1. Add the following files adapted for your model:

flow_grpo/diffusers_patch/sd3_pipeline_with_logprob.py: This file is adapted from pipeline_stable_diffusion_3.py. You can refer to diffusers for your model.
scripts/train_sd3.py: This script is based on train_dreambooth_lora_sd3.py from the DreamBooth examples.
flow_grpo/diffusers_patch/sd3_sde_with_logprob.py: This file handles SDE sampling. In most cases, you don't need to modify it. However, if your definitions of dt or velocity differ in sign or convention, please adjust accordingly.

2. Verify SDE sampling: Set noise_level = 0 in sde_demo.py to check whether the generated images look normal. This helps verify that your SDE implementation is correct.

3. Ensure on-policy consistency: Set config.sample.num_batches_per_epoch = 1 and config.train.gradient_accumulation_steps = 1 to enforce a purely on-policy setup, where the model collecting samples is identical to the one being trained. Under this setting, the ratio should remain exactly 1. If it's not, please check whether the sampling and training code paths differ—for example, through use of torch.compile or other model wrappers—and make sure both share the same logic.

4. Tune reward behavior: Start with config.train.beta = 0 to observe if the reward increases during training. You may also need to adjust the noise level here based on your model. Other hyperparameters are generally model-agnostic and can be kept as default.

🏁 Multi Reward Training

For multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight. For example:

{
    "pickscore": 0.5,
    "ocr": 0.2,
    "aesthetic": 0.3
}

This means the final reward is a weighted sum of the individual rewards.

The following reward models are currently supported:

Geneval evaluates T2I models on complex compositional prompts.
OCR provides an OCR-based reward.
PickScore is a general-purpose T2I reward model trained on human preferences.
DeQA is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.
ImageReward is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.
QwenVL is an experimental reward model using prompt engineering.
Aesthetic is a CLIP-based linear regressor predicting image aesthetic scores.
JPEG_Compressibility measures image size as a proxy for quality.
UnifiedReward is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.

✨ Important Hyperparameters

You can adjust the parameters in config/grpo.py to tune different hyperparameters. An empirical finding is that config.sample.train_batch_size * num_gpu / config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48, i.e., group_number=48, group_size=24. Additionally, setting config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch // 2.

🤗 Acknowledgement

This repo is based on ddpo-pytorch and diffusers. We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent ddpo-pytorch repo.

⭐Citation

@article{liu2025flow,
  title={Flow-grpo: Training flow matching models via online rl},
  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
  journal={arXiv preprint arXiv:2505.05470},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
config		config
dataset		dataset
flow_grpo		flow_grpo
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flow-GRPO:
Training Flow Matching Models via Online RL

Changelog

🤗 Model

🚀 Quick Started

1. Environment Set Up

2. Model Download

3. Reward Preparation

GenEval

OCR

Pickscore

DeQA

UnifiedReward

ImageReward

4. Start Training

GRPO

DPO / OnlineDPO / SFT / OnlineSFT

How to Support Other Models

🏁 Multi Reward Training

✨ Important Hyperparameters

🤗 Acknowledgement

⭐Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

yifan123/flow_grpo

Folders and files

Latest commit

History

Repository files navigation

Flow-GRPO:Training Flow Matching Models via Online RL

Changelog

🤗 Model

🚀 Quick Started

1. Environment Set Up

2. Model Download

3. Reward Preparation

GenEval

OCR

Pickscore

DeQA

UnifiedReward

ImageReward

4. Start Training

GRPO

DPO / OnlineDPO / SFT / OnlineSFT

How to Support Other Models

🏁 Multi Reward Training

✨ Important Hyperparameters

🤗 Acknowledgement

⭐Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Flow-GRPO:
Training Flow Matching Models via Online RL

Packages