You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The log is stored in the --log-dir. One can see the training curve via tensorboard.
To modify the number of sampled actions, specify --num tag, default is 10. To add normalization to offline data, specify --normalize tag (but this is not required).
Instruction
In the paper and our implementation, we update the critics via:
$\mathcal{L}_{critic} = \lambda \mathbb{E}_{s,a,s^\prime\sim\mathcal{D},a^\prime\sim\pi(\cdot|s^\prime)}[(Q(s,a) - y)^2] + (1-\lambda)\mathbb{E}_{s\sim\mathcal{D},a\sim\pi(\cdot|s)}[(Q(s,a) - y^\prime)^2]$. While one can also try to update the critic via: $\mathcal{L}_{critic} = \mathbb{E}_{s,a,s^\prime\sim\mathcal{D},a^\prime\sim\pi(\cdot|s^\prime)}[(Q(s,a) - y)^2] + \alpha\mathbb{E}_{s\sim\mathcal{D},a\sim\pi(\cdot|s)}[(Q(s,a) - y^\prime)^2]$. It is also reasonable since we ought not to let $\lambda=0$. At this time, $\alpha = \frac{1-\lambda}{\lambda}, \lambda\in(0,1)$. Note that the hyperparameter scale would vastly change using $\alpha$ (e.g., if we let $\lambda = 0.1, \alpha=9$ while if $\lambda=0.5, \alpha=1$).
We do welcome the reader to try running with $\alpha$-style update rule.
Citation
If you use our method or code in your research, please consider citing the paper as follows:
@inproceedings{lyu2022mildly,
title={Mildly Conservative Q-learning for Offline Reinforcement Learning},
author={Jiafei Lyu and Xiaoteng Ma and Xiu Li and Zongqing Lu},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems},
year={2022}
}
About
Code for Mildly Conservative Q-learning for Offline Reinforcement Learning (NeurIPS 2022)