Diffusion steering via reinforcement learning (DSRL) is a lightweight and efficient method for RL finetuning of diffusion and flow policies. Rather than modifying the weights of the diffusion/flow policy, DSRL instead modifies the noise distribution sampled from to begin the denoising process.
- Clone repository
git clone --recurse-submodules git@github.com:ajwagen/dsrl.git
cd dsrl
- Create conda environment
conda create -n dsrl python=3.9 -y
conda activate dsrl
- Install our fork of DPPO
cd dppo
pip install -e .
pip install -e .[robomimic]
pip install -e .[gym]
cd ..
- Install our fork of Stable Baselines3
cd stable-baselines3
pip install -e .
cd ..
The diffusion policy checkpoints for the Robomimic and Gym experiments can be found here. Download the contents of this folder and place in ./dppo/log.
To run DSRL on Robomimic, call
python train_dsrl.py --config-path=cfg/robomimic --config-name=dsrl_can.yaml
where dsrl_can.yaml is set to the config file for the desired task. Similarly, for Gym, call
python train_dsrl.py --config-path=cfg/gym --config-name=dsrl_hopper.yaml
where dsrl_hopper.yaml is set to the config file for the desired task.
It is straightforward to apply DSRL to new settings. Doing this typically requires:
- Access to a diffusion or flow policy with the ability to control the noise initializing the denoising process. Note that if using a diffusion policy it must be sampled from with DDIM sampling.
- In the case of
DSRL-NA, the diffusion/flow policy is passed to theSACDiffusionNoisealgorithm, and then this algorithm is simply run on a standard gym environment. - In the case of
DSRL-SAC, it is recommended that you write a wrapper around your environment which transforms the action space from the original action space to the noise space of the diffusion/noise policy. Here, the noise action given to the environment wrapper is then denoised through the diffusion policy, and this denoised action is played on the original environment, all of which is performed within the environment wrapper. See theDiffusionPolicyEnvWrapperinenv_utils.pyfor an example of this.
The following may be helpful in tuning DSRL on new settings:
- Typically the key hyperparameters to tune are
action_magnitudeandutd.action_magnitudecontrols how large a noise value can be played in the noise action space, andutdis the number of gradient steps taken per update. Typically settingaction_magnitudearound 1.5 andutdaround 20 performs effectively, but for best performance these should be tuned on new environments. - As described in the paper, there are two primary variants of the algorithm:
DSRL-NAandDSRL-SAC.DSRL-SACsimply runsSACwith the action space the noise space of the diffusion policy, whileDSRL-NAdistills a Q-function learned on the original action space (see the paper for further details). In generalDSRL-NAis more sample efficient and should be preferred toDSRL-SAC, howeverDSRL-SACis somewhat more computationally efficient in settings where speed is critical. - DSRL typically performs best when using relatively large actor and critic networks. A reasonable value here is typically using a 3-layer MLP of width 2048. Tuning the size can sometimes lead to further gains.
Our implementation of DSRL is built on top of Stable Baselines3. For our diffusion policy implementation, we utilize the implementation given in the DPPO codebase.
@article{wagenmaker2025steering,
author = {Wagenmaker, Andrew and Nakamoto, Mitsuhiko and Zhang, Yunchu and Park, Seohong and Yagoub, Waleed and Nagabandi, Anusha and Gupta, Abhishek and Levine, Sergey},
title = {Steering Your Diffusion Policy with Latent Space Reinforcement Learning},
journal = {Conference on Robot Learning (CoRL)},
year = {2025},
}