【Auto-Parallel | Comm】fix communication hang issue on GPU-H(VPP) #71104

zty-king · 2025-02-12T11:51:06Z

PR Category

Auto Parallel

PR Types

Bug fixes

Description

fix communication hang issue on the GPU-H（VPP）

Split forward into recv_forward and forward; split split backward into send_backward and backward：

paddle-bot · 2025-02-12T11:51:15Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot · 2025-02-22T03:01:57Z

Sorry to inform you that 665ae80's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

From00 · 2025-03-21T02:52:14Z

python/paddle/distributed/passes/pipeline_scheduler_pass/pipeline_vpp.py

@@ -59,6 +67,8 @@ def _record_bwd_micro_step(self, virtual_pp_rank):
        return real_micro_step

    def _create_job_list(self):
+        if self._in_pir_mode:
+            return self._pir_create_job_list()


加了这个判断之后，原本的代码在什么条件下会跑到？

设置FLAGS_enable_pir_api=1时使用pir模式，此时会使用self._pir_create_job_list()；否则使用_create_job_list；之所以保留了这个_create_job_list，是因为之前提交的时候，有些单测还没切换完，仍然使用的是旧ir模式，CI会没法通过，等单测全部切换完，就可以把这里的_create_job_list给删掉了

From00 · 2025-03-21T02:56:21Z

python/paddle/distributed/passes/pass_utils.py

@@ -1113,33 +1116,99 @@ def add_persistable_var(op_idx, program_type):
        for type in following_program_types:
            type_to_ops[type][op_idx].erase()

+    def _add_dependency(recorder_op, waiter_op, name):


这个函数在pipeline_1f1b中有一份完全相同的，应该进行复用，而不是拷贝一份。

From00 · 2025-03-21T02:57:04Z

python/paddle/distributed/passes/pass_utils.py

+        recorder_op.set_str_attr("event_to_record", name)
+        waiter_op.set_str_array_attr("events_to_wait", [name])
+
+    def _add_dependency_if_necessary(


From00 · 2025-03-21T03:00:52Z

python/paddle/distributed/passes/pass_utils.py

+        chunk_ids = list(range(num_model_chunks))
+        # Forward process: the recv and forward of each chunk are put together
+        for chunk_id in chunk_ids:
+            type_to_program[f"recv_forward{chunk_id}"] = program.clone()


这几行代码有大量地方逻辑类似，可否抽象成公共函数调用？

… pipeline_vpp

From00

LGTM

liym27 · 2025-03-25T04:33:44Z

python/paddle/distributed/passes/pass_utils.py

+
+    return program_name, cloned_program, ops
+
+
 def _split_program_for_vpp(
    program, num_model_chunks, oprole_names, split_bw=False


关于模块解耦：考虑代码维护、可读性以及SRP，

_split_program_for_vpp 以及调用它的 _pir_program_for_vpp 是不是都迁移 pipeline_vpp.py 比较好？
当前 pass_utils.py 里的代码臃肿（有近70个函数），这2个函数只有 vpp 会用到，其它地方没有调用。

_add_dependency_if_necessary 仅用在 pp 场景，建议迁移到相关目录或文件，如 pipeline_scheduler_pass/pipeline_pass_base.py

liym27 · 2025-03-25T04:40:55Z

python/paddle/distributed/passes/pipeline_scheduler_pass/pipeline_vpp.py

@@ -32,6 +32,8 @@
 )
 from .pipeline_pass_base import PipelinePassBase

+RECV_FORWARD = "recv_forward"
+SEND_BACKWARD = "send_backward"
 FORWARD = "forward"
 BACKWARD = "backward"
 OPT = "optimizer"


RECV_FORWARD OPT 等多个常量，已在 pipline_1f1b.py 定义过一遍，建议统一抽象成一份

已将 pipeline_scheduler_pass 模块中所有与 PP 策略相关的常量迁移为 PipelinePassBase 类的成员，通过 self 引用实现统一访问。

liym27

LGTM

…dlePaddle#71104) * fix communication hang issue on GPU-H(VPP) * fix communication hang issue on GPU-H(VPP) * fix communication hang issue on GPU-H(VPP) * 构建公共函数进行复用 * 优化代码结构 * 优化代码结构 * 将1F1B切program代码统一规范化到Pipeline1F1BPass * 修正1F1B的代码 * 修正1F1B的代码

paddle-bot bot added the contributor External developers label Feb 12, 2025

fix communication hang issue on GPU-H(VPP)

da290b4

zty-king force-pushed the pipeline_vpp branch from 665ae80 to da290b4 Compare March 18, 2025 17:01

zty-king added 2 commits March 18, 2025 17:07

fix communication hang issue on GPU-H(VPP)

0c224a0

fix communication hang issue on GPU-H(VPP)

02e4adf

From00 reviewed Mar 21, 2025

View reviewed changes

zty-king added 2 commits March 22, 2025 16:02

构建公共函数进行复用

bc29637

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cc0942f

… pipeline_vpp

From00 previously approved these changes Mar 25, 2025

View reviewed changes

liym27 reviewed Mar 25, 2025

View reviewed changes

优化代码结构

f395aee

zty-king dismissed From00’s stale review via f395aee March 26, 2025 04:15

zty-king added 4 commits March 26, 2025 05:25

优化代码结构

bba24f3

将1F1B切program代码统一规范化到Pipeline1F1BPass

efab6ed

修正1F1B的代码

4d8735d

修正1F1B的代码

cc19809

liym27 approved these changes Mar 28, 2025

View reviewed changes

From00 merged commit f1707f7 into PaddlePaddle:develop Mar 28, 2025
32 checks passed

【Auto-Parallel | Comm】fix communication hang issue on GPU-H(VPP) #71104

【Auto-Parallel | Comm】fix communication hang issue on GPU-H(VPP) #71104

Uh oh!

Conversation

zty-king commented Feb 12, 2025

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Feb 12, 2025

Uh oh!

paddle-ci-bot bot commented Feb 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liym27 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!