You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An optimizer which exerts adaptive momental upper bounds on individual learning rates to prevent them becoming undesirably lager than what the historical statistics suggest and avoid the non-convergence issue, thus to a better performance. Strong empirical results on many deep learning applications demonstrate the effectiveness of our proposed method especially on complex networks such as DenseNet and Transformer.
As described in the paper, AdaMod can smooths out unexpected large learning rates throughout the training process. The beta3 parameter is the smoothing coefficient for actual learning rate, which controls the average range. In common cases, a beta3 in {0.999,0.9999} can achieve relatively good and stable results. See the paper for more details.
@inproceedings{DBLP:conf/nlpcc/DingRL23,
author = {Jianbang Ding and Xuancheng Ren and Ruixuan Luo},
title = {An Adaptive Learning Method for Solving the Extreme Learning Rate Problem of Transformer},
booktitle = {{NLPCC} {(1)}},
series = {Lecture Notes in Computer Science},
volume = {14302},
pages = {361--372},
publisher = {Springer},
year = {2023}
}
The arXiv version is available as an alternative:
@article{ding2019adaptive,
title={An Adaptive and Momental Bound Method for Stochastic Learning},
author={Jianbang Ding and Xuancheng Ren and Ruixuan Luo and Xu Sun},
journal={arXiv preprint arXiv:1910.12249},
year={2019}
}
Demo
For the full list of demos, please refer to this page.