CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
fix stage2 main_grad acc bug #59142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix stage2 main_grad acc bug #59142
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果有机器资源再跑一下收敛测试吧,把loss曲线放PR描述里。需要对4组吧
main_grad False | main_grad True | |
---|---|---|
bf16 | ||
fp16 |
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor suggestions
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
8e2aa1e
to
194aabc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
merge develop to bug fix branch
paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu
Outdated
Show resolved
Hide resolved
paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py
Outdated
Show resolved
Hide resolved
not hasattr(param, "main_grad") | ||
and grad.dtype == Type.fp16.value | ||
): | ||
if grad is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个if为啥还在啊😂,放到下面那个assert里就可以了吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下个PR改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* fix stage2 main_grad acc bug * update code according to suggest * scale in opt * merge grad scale * add note * delete debug info * keep offload unchange * Optimize the BF16 unittest of sharding stage2 and stage3. * fix stage3 bug * add fp16 judge * add init * add fp16 * fix grad clip * add if data type is fp16 * change if location * delete fault arg * add enmu.value --------- Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com> Co-authored-by: tianhaodongbd <tianhaodong@baidu.com>
* fix stage2 main_grad acc bug * update code according to suggest * scale in opt * merge grad scale * add note * delete debug info * keep offload unchange * Optimize the BF16 unittest of sharding stage2 and stage3. * fix stage3 bug * add fp16 judge * add init * add fp16 * fix grad clip * add if data type is fp16 * change if location * delete fault arg * add enmu.value --------- Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com> Co-authored-by: tianhaodongbd <tianhaodong@baidu.com>
PR types
Bug fixes
PR changes
Others
Description
PCard-70444
1、为了解决Stage2中开main_grad梯度累加的bug(计算顺序出错,原有顺序(grad1/n+grad2)/n, 改完之后,main_grad的梯度累加放到了优化器opt中做,改完之后main_grad的计算顺序如下,(grad1+grad2)/n)。
2、fused_param_linear_grad_add 开acc时,grad均不为空,主要区别是第一个grad被初始化,第二个未被初始化,因此如果在stage2中使用fp16-O2 fused_param_linear_grad_add,会导致计算出错,因此在本PR中做了报错处理(assert(grad._is_initialized())
3、为了确保grad.dtype一定存在,需要先判断grad is not none
收敛测试实验方案如下,测试了三组对比实验,如下图所示
O1/O2 混合精度训练分别走的不同分支
收敛实验结果如下:
2.bf16、关闭main_grad条件下,stage2与stage1收敛验证结果
3.. fp16、关闭main_grad条件下,stage2与stage1收敛验证结果
更新代码后,1w步收敛实验结果如下:
2.bf16、关闭main_grad条件下,stage2与stage1收敛验证结果
3.. fp16、关闭main_grad条件下,stage2与stage1收敛验证结果
4.. fp16、开main_grad条件下,stage2与stage1收敛验证结果