CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
【Auto-Parallel | Comm】fix communication hang issue on GPU-H(VPP) #71104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
Sorry to inform you that 665ae80's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
@@ -59,6 +67,8 @@ def _record_bwd_micro_step(self, virtual_pp_rank): | |||
return real_micro_step | |||
|
|||
def _create_job_list(self): | |||
if self._in_pir_mode: | |||
return self._pir_create_job_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加了这个判断之后,原本的代码在什么条件下会跑到?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
设置FLAGS_enable_pir_api=1时使用pir模式,此时会使用self._pir_create_job_list();否则使用_create_job_list;之所以保留了这个_create_job_list,是因为之前提交的时候,有些单测还没切换完,仍然使用的是旧ir模式,CI会没法通过,等单测全部切换完,就可以把这里的_create_job_list给删掉了
@@ -1113,33 +1116,99 @@ def add_persistable_var(op_idx, program_type): | |||
for type in following_program_types: | |||
type_to_ops[type][op_idx].erase() | |||
|
|||
def _add_dependency(recorder_op, waiter_op, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数在pipeline_1f1b中有一份完全相同的,应该进行复用,而不是拷贝一份。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
recorder_op.set_str_attr("event_to_record", name) | ||
waiter_op.set_str_array_attr("events_to_wait", [name]) | ||
|
||
def _add_dependency_if_necessary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
chunk_ids = list(range(num_model_chunks)) | ||
# Forward process: the recv and forward of each chunk are put together | ||
for chunk_id in chunk_ids: | ||
type_to_program[f"recv_forward{chunk_id}"] = program.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这几行代码有大量地方逻辑类似,可否抽象成公共函数调用?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
return program_name, cloned_program, ops | ||
|
||
|
||
def _split_program_for_vpp( | ||
program, num_model_chunks, oprole_names, split_bw=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于模块解耦:考虑代码维护、可读性以及SRP,
-
_split_program_for_vpp 以及调用它的 _pir_program_for_vpp 是不是都迁移 pipeline_vpp.py 比较好?
当前 pass_utils.py 里的代码臃肿(有近70个函数),这2个函数只有 vpp 会用到,其它地方没有调用。 -
_add_dependency_if_necessary 仅用在 pp 场景,建议迁移到相关目录或文件,如 pipeline_scheduler_pass/pipeline_pass_base.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -32,6 +32,8 @@ | |||
) | |||
from .pipeline_pass_base import PipelinePassBase | |||
|
|||
RECV_FORWARD = "recv_forward" | |||
SEND_BACKWARD = "send_backward" | |||
FORWARD = "forward" | |||
BACKWARD = "backward" | |||
OPT = "optimizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RECV_FORWARD
OPT
等多个常量,已在 pipline_1f1b.py 定义过一遍,建议统一抽象成一份
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已将 pipeline_scheduler_pass 模块中所有与 PP 策略相关的常量迁移为 PipelinePassBase 类的成员,通过 self 引用实现统一访问。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…dlePaddle#71104) * fix communication hang issue on GPU-H(VPP) * fix communication hang issue on GPU-H(VPP) * fix communication hang issue on GPU-H(VPP) * 构建公共函数进行复用 * 优化代码结构 * 优化代码结构 * 将1F1B切program代码统一规范化到Pipeline1F1BPass * 修正1F1B的代码 * 修正1F1B的代码
…dlePaddle#71104) * fix communication hang issue on GPU-H(VPP) * fix communication hang issue on GPU-H(VPP) * fix communication hang issue on GPU-H(VPP) * 构建公共函数进行复用 * 优化代码结构 * 优化代码结构 * 将1F1B切program代码统一规范化到Pipeline1F1BPass * 修正1F1B的代码 * 修正1F1B的代码
PR Category
Auto Parallel
PR Types
Bug fixes
Description
fix communication hang issue on the GPU-H(VPP)