Mildly Conservative $Q$-learning (MCQ) for Offline Reinforcement Learning

Original PyTorch implementation of MCQ (NeurIPS 2022) from Mildly Conservative Q-learning for Offline Reinforcement Learning. The code is highly based on the offlineRL repository.

Install

To use this codebase, one need to install the following dependencies:

fire
loguru
tianshou==0.4.2
gym<=0.18.3
mujoco-py==2.0.2.8
sklearn
gtimer
torch==1.8.0
d4rl==1.1
rlkit==0.2.1dev
neorl==0.3.0 (https://github.com/polixir/NeoRL)

Once you have all the dependencies installed, run the following command

pip install -e .

I use python=3.8.5 to run all of the experiments. If you encounter errors of python version conflict, you can try run MCQ in python3.8 environment.

How to run

For MuJoCo tasks, we conduct experiments on d4rl MuJoCo "-v2" datasets by calling

python examples/train_d4rl.py --algo_name=MCQ --task d4rl-hopper-medium-replay-v2 --seed 6 --lam 0.9 --log-dir=logs/hopper-medium-replay/r6

For Adroit "-v0"/maze2d "-v1" tasks, we run on these datasets by calling

python examples/train_d4rl.py --algo_name=MCQ --task d4rl-maze2d-medium-v1 --seed 6 --lam 0.9 --log-dir=logs/maze2d-medium-v1/r6

The log is stored in the --log-dir. One can see the training curve via tensorboard.

To modify the number of sampled actions, specify --num tag, default is 10. To add normalization to offline data, specify --normalize tag (but this is not required).

Instruction

In the paper and our implementation, we update the critics via: $\mathcal{L}_{critic} = \lambda \mathbb{E}_{s,a,s^\prime\sim\mathcal{D},a^\prime\sim\pi(\cdot|s^\prime)}[(Q(s,a) - y)^2] + (1-\lambda)\mathbb{E}_{s\sim\mathcal{D},a\sim\pi(\cdot|s)}[(Q(s,a) - y^\prime)^2]$. While one can also try to update the critic via: $\mathcal{L}_{critic} = \mathbb{E}_{s,a,s^\prime\sim\mathcal{D},a^\prime\sim\pi(\cdot|s^\prime)}[(Q(s,a) - y)^2] + \alpha\mathbb{E}_{s\sim\mathcal{D},a\sim\pi(\cdot|s)}[(Q(s,a) - y^\prime)^2]$. It is also reasonable since we ought not to let $\lambda=0$. At this time, $\alpha = \frac{1-\lambda}{\lambda}, \lambda\in(0,1)$. Note that the hyperparameter scale would vastly change using $\alpha$ (e.g., if we let $\lambda = 0.1, \alpha=9$ while if $\lambda=0.5, \alpha=1$).

We do welcome the reader to try running with $\alpha$-style update rule.

Citation

If you use our method or code in your research, please consider citing the paper as follows:

@inproceedings{lyu2022mildly,
 title={Mildly Conservative Q-learning for Offline Reinforcement Learning},
 author={Jiafei Lyu and Xiaoteng Ma and Xiu Li and Zongqing Lu},
 booktitle={Thirty-sixth Conference on Neural Information Processing Systems},
 year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
examples		examples
offlinerl		offlinerl
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mildly Conservative $Q$-learning (MCQ) for Offline Reinforcement Learning

Install

How to run

Instruction

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dmksjfl/MCQ

Folders and files

Latest commit

History

Repository files navigation

Mildly Conservative $Q$-learning (MCQ) for Offline Reinforcement Learning

Install

How to run

Instruction

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages