Steering Your Diffusion Policy with Latent Space Reinforcement Learning (DSRL)

[website] [pdf]

Overview

Diffusion steering via reinforcement learning (DSRL) is a lightweight and efficient method for RL finetuning of diffusion and flow policies. Rather than modifying the weights of the diffusion/flow policy, DSRL instead modifies the noise distribution sampled from to begin the denoising process.

Installation

Clone repository

git clone --recurse-submodules git@github.com:ajwagen/dsrl.git
cd dsrl

Create conda environment

conda create -n dsrl python=3.9 -y
conda activate dsrl

Install our fork of DPPO

cd dppo
pip install -e .
pip install -e .[robomimic]
pip install -e .[gym]
cd ..

Install our fork of Stable Baselines3

cd stable-baselines3
pip install -e .
cd ..

The diffusion policy checkpoints for the Robomimic and Gym experiments can be found here. Download the contents of this folder and place in ./dppo/log.

Running DSRL

To run DSRL on Robomimic, call

python train_dsrl.py --config-path=cfg/robomimic --config-name=dsrl_can.yaml

where dsrl_can.yaml is set to the config file for the desired task. Similarly, for Gym, call

python train_dsrl.py --config-path=cfg/gym --config-name=dsrl_hopper.yaml

where dsrl_hopper.yaml is set to the config file for the desired task.

Applying DSRL to new settings

It is straightforward to apply DSRL to new settings. Doing this typically requires:

Access to a diffusion or flow policy with the ability to control the noise initializing the denoising process. Note that if using a diffusion policy it must be sampled from with DDIM sampling.
In the case of DSRL-NA, the diffusion/flow policy is passed to the SACDiffusionNoise algorithm, and then this algorithm is simply run on a standard gym environment.
In the case of DSRL-SAC, it is recommended that you write a wrapper around your environment which transforms the action space from the original action space to the noise space of the diffusion/noise policy. Here, the noise action given to the environment wrapper is then denoised through the diffusion policy, and this denoised action is played on the original environment, all of which is performed within the environment wrapper. See the DiffusionPolicyEnvWrapper in env_utils.py for an example of this.

Tips for hyperparameter tuning

The following may be helpful in tuning DSRL on new settings:

Typically the key hyperparameters to tune are action_magnitude and utd. action_magnitude controls how large a noise value can be played in the noise action space, and utd is the number of gradient steps taken per update. Typically setting action_magnitude around 1.5 and utd around 20 performs effectively, but for best performance these should be tuned on new environments.
As described in the paper, there are two primary variants of the algorithm: DSRL-NA and DSRL-SAC. DSRL-SAC simply runs SAC with the action space the noise space of the diffusion policy, while DSRL-NA distills a Q-function learned on the original action space (see the paper for further details). In general DSRL-NA is more sample efficient and should be preferred to DSRL-SAC, however DSRL-SAC is somewhat more computationally efficient in settings where speed is critical.
DSRL typically performs best when using relatively large actor and critic networks. A reasonable value here is typically using a 3-layer MLP of width 2048. Tuning the size can sometimes lead to further gains.

Acknowledgements

Our implementation of DSRL is built on top of Stable Baselines3. For our diffusion policy implementation, we utilize the implementation given in the DPPO codebase.

Citation

@article{wagenmaker2025steering,
  author    = {Wagenmaker, Andrew and Nakamoto, Mitsuhiko and Zhang, Yunchu and Park, Seohong and Yagoub, Waleed and Nagabandi, Anusha and Gupta, Abhishek and Levine, Sergey},
  title     = {Steering Your Diffusion Policy with Latent Space Reinforcement Learning},
  journal   = {Conference on Robot Learning (CoRL)},
  year      = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
cfg		cfg
dppo @ 86ce518		dppo @ 86ce518
stable-baselines3 @ 10e5d31		stable-baselines3 @ 10e5d31
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
env_utils.py		env_utils.py
train_dsrl.py		train_dsrl.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Steering Your Diffusion Policy with Latent Space Reinforcement Learning (DSRL)

[website] [pdf]

Overview

Installation

Running DSRL

Applying DSRL to new settings

Tips for hyperparameter tuning

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Languages

ajwagen/dsrl

Folders and files

Latest commit

History

Repository files navigation

Steering Your Diffusion Policy with Latent Space Reinforcement Learning (DSRL)

[website] [pdf]

Overview

Installation

Running DSRL

Applying DSRL to new settings

Tips for hyperparameter tuning

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages