fix stage2 main_grad acc bug #59142

iosmers · 2023-11-20T03:06:59Z

PR types

Bug fixes

PR changes

Others

Description

PCard-70444
1、为了解决Stage2中开main_grad梯度累加的bug(计算顺序出错，原有顺序（grad1/n+grad2)/n, 改完之后，main_grad的梯度累加放到了优化器opt中做，改完之后main_grad的计算顺序如下，（grad1+grad2)/n）。
2、fused_param_linear_grad_add 开acc时，grad均不为空，主要区别是第一个grad被初始化，第二个未被初始化，因此如果在stage2中使用fp16-O2 fused_param_linear_grad_add，会导致计算出错，因此在本PR中做了报错处理(assert(grad._is_initialized())
3、为了确保grad.dtype一定存在，需要先判断grad is not none

收敛测试实验方案如下，测试了三组对比实验，如下图所示

	main_grad	no main_grad
bf16	✅	✅
fp16	✅	✅

O1/O2 混合精度训练分别走的不同分支

	O1	O2 main_grad	O2 no main_grad
bf16	opt	opt	opt
fp16	scale	opt	scale

收敛实验结果如下：

bf16、开启main_grad条件下，stage2与stage1收敛验证结果

2.bf16、关闭main_grad条件下，stage2与stage1收敛验证结果

3.. fp16、关闭main_grad条件下，stage2与stage1收敛验证结果

更新代码后，1w步收敛实验结果如下：

bf16、开启main_grad条件下，stage2与stage1收敛验证结果

2.bf16、关闭main_grad条件下，stage2与stage1收敛验证结果

3.. fp16、关闭main_grad条件下，stage2与stage1收敛验证结果

4.. fp16、开main_grad条件下，stage2与stage1收敛验证结果

paddle-bot · 2023-11-20T03:07:04Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py

FeixLiu

如果有机器资源再跑一下收敛测试吧，把loss曲线放PR描述里。需要对4组吧

	main_grad False	main_grad True
bf16
fp16

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py

FeixLiu

some minor suggestions

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py

FeixLiu

LGTM

haohongxiang

LGTM

merge yq code

merge develop to bug fix branch

paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py

FeixLiu · 2023-12-14T02:20:01Z

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py

+                    not hasattr(param, "main_grad")
+                    and grad.dtype == Type.fp16.value
+                ):
+                    if grad is None:


这个if为啥还在啊😂，放到下面那个assert里就可以了吧。

下个PR改

Xreki

LGTM

* fix stage2 main_grad acc bug * update code according to suggest * scale in opt * merge grad scale * add note * delete debug info * keep offload unchange * Optimize the BF16 unittest of sharding stage2 and stage3. * fix stage3 bug * add fp16 judge * add init * add fp16 * fix grad clip * add if data type is fp16 * change if location * delete fault arg * add enmu.value --------- Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com> Co-authored-by: tianhaodongbd <tianhaodong@baidu.com>

FeixLiu reviewed Nov 20, 2023

View reviewed changes

FeixLiu reviewed Nov 21, 2023

View reviewed changes

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py Outdated Show resolved Hide resolved

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py Outdated Show resolved Hide resolved

FeixLiu reviewed Nov 24, 2023

View reviewed changes

iosmers added 12 commits November 27, 2023 16:03

fix stage2 main_grad acc bug

c69e7e7

update

8443199

update code according to suggest

afde2a5

commit

8713236

update code

f48920e

scale in opt

0f502dd

merge grad scale

4a69734

add

c07b894

add note

6cd0b7f

delete debug info

3e0d18b

delete debug info

74b9032

keep offload unchange

194aabc

iosmers force-pushed the fix_stage2_main_grad_acc branch from 8e2aa1e to 194aabc Compare November 27, 2023 08:06

FeixLiu previously approved these changes Nov 27, 2023

View reviewed changes

haohongxiang previously approved these changes Nov 30, 2023

View reviewed changes

Xreki and others added 7 commits December 1, 2023 17:15

Optimize the BF16 unittest of sharding stage2 and stage3.

87ba9b1

fix stage3 bug

f05f5e7

fix bug

6d013bc

Merge branch 'develop' into dist/opt_ut_sharding

85fb098

Merge branch 'develop' into dist/opt_ut_sharding

6ed8915

Merge branch 'opt_ut_sharding' into fix_stage2_main_grad_acc

af1527b

merge yq code

add fp16 judge

b36b559

iosmers dismissed stale reviews from haohongxiang and FeixLiu via b36b559 December 11, 2023 03:40

iosmers added 2 commits December 11, 2023 13:14

fix bug

1de7439

add init

625a8c3

iosmers added 3 commits December 11, 2023 16:38

add fp16

d4c3fdf

Merge branch 'develop' into fix_stage2_main_grad_acc

8b06c64

merge develop to bug fix branch

fix grad clip

67c3b4a

Xreki reviewed Dec 12, 2023

View reviewed changes

paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu Outdated Show resolved Hide resolved

iosmers added 3 commits December 12, 2023 10:40

add if data type is fp16

9de87c4

datattype

66de6c7

change if location

40edf6f

FeixLiu reviewed Dec 12, 2023

View reviewed changes

paddle/phi/kernels/fusion/gpu/fused_linear_param_grad_add_kernel.cu Outdated Show resolved Hide resolved

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py Outdated Show resolved Hide resolved

iosmers added 4 commits December 12, 2023 17:15

delete fault arg

4aa84f2

update

6766b8d

update

8886537

update

64adcdb

FeixLiu previously approved these changes Dec 12, 2023

View reviewed changes

python/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_stage2.py Outdated Show resolved Hide resolved

iosmers added 2 commits December 13, 2023 11:02

Merge branch 'fix_stage3_bug' into fix_stage2_main_grad_acc

544ad51

add enmu.value

cfd8d96

iosmers dismissed FeixLiu’s stale review via cfd8d96 December 13, 2023 04:13

add enmu.value

096b914

FeixLiu previously approved these changes Dec 14, 2023

View reviewed changes

update

d98de04

iosmers dismissed FeixLiu’s stale review via d98de04 December 14, 2023 03:11

iosmers added 2 commits December 14, 2023 11:36

update

3019ff5

update

7062e31

Xreki approved these changes Dec 15, 2023

View reviewed changes

Xreki merged commit 929174f into PaddlePaddle:develop Dec 15, 2023

Xreki mentioned this pull request Dec 15, 2023

Optimize the BF16 unittest of sharding stage2 and stage3. #59602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix stage2 main_grad acc bug #59142

fix stage2 main_grad acc bug #59142

Uh oh!

iosmers commented Nov 20, 2023 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FeixLiu left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

FeixLiu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FeixLiu left a comment

Uh oh!

haohongxiang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FeixLiu Dec 14, 2023

Uh oh!

iosmers Dec 14, 2023

Uh oh!

Xreki left a comment

Uh oh!

Uh oh!

fix stage2 main_grad acc bug #59142

fix stage2 main_grad acc bug #59142

Uh oh!

Conversation

iosmers commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Nov 20, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FeixLiu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FeixLiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FeixLiu left a comment

Choose a reason for hiding this comment

Uh oh!

haohongxiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FeixLiu Dec 14, 2023

Choose a reason for hiding this comment

Uh oh!

iosmers Dec 14, 2023

Choose a reason for hiding this comment

Uh oh!

Xreki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iosmers commented Nov 20, 2023 •

edited

Loading

FeixLiu left a comment •

edited

Loading