[auto parallel] Support 1F1B for PIR #66810

AndSonder · 2024-07-30T13:20:29Z

PR Category

auto parallel

PR Types

Not User Facing

Description

Pcard-76459
为 PIR 适配 1F1B pass

into develop

… fit_grad_merge_for_pir

paddle-bot · 2024-07-30T13:20:37Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… fit_1f1b_for_pir

JZ-LIANG · 2024-08-01T03:26:19Z

python/paddle/distributed/passes/pipeline_scheduler_pass/pipeline_1f1b.py

+        ), "PIR does not support 1F1B with enable_send_recv_overlap yet."
+
+        types = [FORWARD, BACKWARD, OPT]
+        sub_program_list = _pir_program_for_fthenb_and_1f1b(


这个切图功能应该通用化，给 gradmerge 那边应该也能用

这个切图功能本来就是通用化的，可能函数命名有误解，它的主要作用就是把 program 分成 forward, backward, opt 三个子图

JZ-LIANG · 2024-08-01T03:30:59Z

test/auto_parallel/pir/1F1B_pass_unittest_pir.py

+            print("loss_fthenb", loss_fthenb)
+            print("loss_1f1b", loss_1f1b)
+            self.assertTrue(
+                np.allclose(


check 太松了
如果只是 1f1b 和 fthenb 间的对比，应该可以做到 all_equal

JZ-LIANG · 2024-08-01T03:33:39Z

test/auto_parallel/pir/1F1B_pass_unittest_pir.py

+        pipeline = strategy.pipeline
+        pipeline.enable = True
+        pipeline.schedule_mode = "1F1B"
+        pipeline.accumulate_steps = 2


pp2-acc2 这个并行度有点低，建议后续可以补一个更高并行度的单测，并且额外测试和现在已经有个功能策略的兼容性，比如：tp2-pp4-acc8-amp-o2。

JZ-LIANG · 2024-08-01T03:38:33Z

test/auto_parallel/pir/1F1B_pass_unittest_pir.py

+            print("loss_1f1b", loss_1f1b)
+            self.assertTrue(
+                np.allclose(
+                    loss_fthenb, loss_1f1b, rtol=self.rtol, atol=self.atol


另外端到端的精度对齐 check 太粗了，最好补一些核心逻辑的功能性 check，比如 check 编排逻辑，比如 pp=4，acc=8 时，1f1b 和 fthenb 在不同 rank 下的编排调度序是否符合预期

已添加相关单测

JZ-LIANG

LGTM

…into fit_1f1b_for_pir

… fit_1f1b_for_pir

…into fit_1f1b_for_pir

AndSonder · 2024-08-07T03:34:24Z

python/paddle/distributed/auto_parallel/static/engine.py

+        if mode == "train" and self._strategy.pipeline.enable:
+            self._strategy.gradient_merge.enable = True
+            self._strategy.gradient_merge.k_steps = (
+                self._strategy.pipeline.accumulate_steps
+            )
+            self._strategy.gradient_merge.avg = True


[New add] 这里补全了开启 pipeline 的情况下对于 grad_merge 相关 strategy 配置的修改

AndSonder · 2024-08-07T03:35:45Z

python/paddle/distributed/passes/auto_parallel_gradient_merge.py

@@ -285,7 +289,7 @@ def _pir_append_gradient_merge_backward_op(
        with startup_block:
            paddle.pir.set_insertion_point_to_block_end(startup_block)
            gradient_merge_var = paddle.full(
-                shape=grad.shape, fill_value=float(0), dtype=grad.dtype
+                shape=grad._local_shape, fill_value=float(0), dtype=grad.dtype


[New add] 这里应该是 grad._local_shape, 直接用 grad.shape 的话 grad 被 shard 了会报错，这个 case 是修改 semi_auto_llama_acc_align.py 开启 grad_merge 的情况下测出来的

AndSonder · 2024-08-07T03:36:28Z

python/paddle/distributed/passes/auto_parallel_gradient_merge.py

+            opt_ops_use_grad = [
+                op
+                for op in grad.all_used_ops()
+                if op.op_role == int(OpRole.Optimize)
+            ]
+            grad.replace_grad_users_with(
+                new_gradient_merge_var, set(opt_ops_use_grad)
+            )


[New add] 上一个版本这里的代码有问题，应该只修改 opt op 的输入

AndSonder · 2024-08-07T03:37:03Z

python/paddle/distributed/passes/auto_parallel_gradient_merge.py

-            "pd_op.c_reduce_sum",
-            "pd_op.c_reduce_avg",
-        ]:
+        if op.name() in comm_ops:


[New add] 替换成了 pir_utils 里面维护的 comm_ops

AndSonder · 2024-08-07T03:37:24Z

python/paddle/distributed/passes/auto_parallel_gradient_merge.py

+            new_grad.get_defining_op().op_role = int(OpRole.Optimize)
+            scale.get_defining_op().op_role = int(OpRole.Optimize)


[New add] 修正 scale 和 full 的 op role

heavyrain-lzy and others added 22 commits July 17, 2024 17:12

split program and polish the executor

da84e28

merge

e6fee47

change unit test

403db76

fix 1f1b test

23eed1a

polish

9870779

polish

f2a63b4

polish

0ea4320

merge develop

03c5d20

fix conflict

2125aca

Merge branch 'support_pp_pir' of https://github.com/heavyrain-lzy/Paddle

0f0cc15

into develop

add pp in pir

e3a68f1

fix ut

391af96

fit grad_merge for pir

68557a3

Merge from heavyrain-lzy/support_pp_pir

c7b228f

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

16ae000

… fit_grad_merge_for_pir

Merge branch 'develop' into fit_grad_merge_for_pir

b58dab2

update test

3c42563

apply suggestions from code review

24644fa

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5b8b788

… fit_grad_merge_for_pir

fit for 1f1b

27a0935

fit_1f1b

87090b8

merge from dev

9d73113

remove no related codes

8f6b075

paddle-bot bot added the contributor External developers label Jul 30, 2024

AndSonder added 5 commits July 31, 2024 05:37

recover codes about enable_send_recv_overlap

eb6b346

recover codes about _is_master_grad_cast_op

f0ef9dd

recover codes in base_cost

9e5453f

recover codes in base_cost

855b59c

change logic about get pp stage

6cd811d

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

31b4331

… fit_1f1b_for_pir

JZ-LIANG reviewed Aug 1, 2024

View reviewed changes

AndSonder added 2 commits August 1, 2024 04:46

add test for 1f1b plan

503f55d

fix

d04241d

JZ-LIANG previously approved these changes Aug 1, 2024

View reviewed changes

Merge branch 'develop' into fit_1f1b_for_pir

c7fb3de

AndSonder dismissed JZ-LIANG’s stale review via c7fb3de August 2, 2024 03:31

AndSonder and others added 4 commits August 5, 2024 02:01

merge from dev

1f88adf

enable gradient_merge when pipeline pass enabled

ed8e6bd

Merge branch 'fit_1f1b_for_pir' of https://github.com/AndSonder/Paddle …

cf9e19c

…into fit_1f1b_for_pir

Update engine.py

b55600c

AndSonder changed the title ~~[Auto Parallel] Support 1F1B for PIR~~ [auto parallel] Support 1F1B for PIR Aug 5, 2024

AndSonder added 5 commits August 6, 2024 09:44

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cd76179

… fit_1f1b_for_pir

fix

5f27b22

fix

34d5362

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

1257361

… fit_1f1b_for_pir

Merge branch 'fit_1f1b_for_pir' of https://github.com/AndSonder/Paddle …

63cc57b

…into fit_1f1b_for_pir

AndSonder commented Aug 7, 2024

View reviewed changes

XieYunshen approved these changes Aug 8, 2024

View reviewed changes

JZ-LIANG merged commit efdd967 into PaddlePaddle:develop Aug 8, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[auto parallel] Support 1F1B for PIR #66810

[auto parallel] Support 1F1B for PIR #66810

Uh oh!

AndSonder commented Jul 30, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 30, 2024

Uh oh!

JZ-LIANG Aug 1, 2024

Uh oh!

AndSonder Aug 1, 2024

Uh oh!

JZ-LIANG Aug 1, 2024

Uh oh!

JZ-LIANG Aug 1, 2024

Uh oh!

AndSonder Aug 1, 2024

Uh oh!

JZ-LIANG Aug 1, 2024

Uh oh!

AndSonder Aug 1, 2024

Uh oh!

JZ-LIANG left a comment

Uh oh!

AndSonder Aug 7, 2024

Uh oh!

AndSonder Aug 7, 2024 •

edited

Loading

Uh oh!

AndSonder Aug 7, 2024

Uh oh!

AndSonder Aug 7, 2024

Uh oh!

AndSonder Aug 7, 2024

Uh oh!

Uh oh!

Uh oh!

		new_grad.get_defining_op().op_role = int(OpRole.Optimize)
		scale.get_defining_op().op_role = int(OpRole.Optimize)

[auto parallel] Support 1F1B for PIR #66810

[auto parallel] Support 1F1B for PIR #66810

Uh oh!

Conversation

AndSonder commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jul 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndSonder Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AndSonder commented Jul 30, 2024 •

edited

Loading

AndSonder Aug 7, 2024 •

edited

Loading