CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
【Auto-Parallel | Comm】fix communication hang issue on GPU-H #70360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
'1', | ||
]: | ||
op.set_execution_stream( | ||
ExecutionStreamType.DefaultStream.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recv通信为啥无法在recv_stream执行?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recv通信为啥无法在recv_stream执行?
在 recv_stream 会hang. 在计算流可以正确执行、不会hang,且因为需要recv完成后才能后续计算,所有对性能应该无影响
recorder_op.set_str_attr("event_to_record", name) | ||
waiter_op.set_str_array_attr("events_to_wait", [name]) | ||
|
||
def _split_program_into_forward_backward_optimize_recv_send( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数特别长(300+行代码),有许多if-else判断,但其核心流程都是相似的:在一个有序的program序列中(fwd -> recv_fwd -> fwd -> bwd -> send_bwd -> opt),指定将某个op保留在其中某个program,在其它program中将该op删除,同时为后继的program建立新的数据依赖关系。
这个操作可否抽象成可复用的公共函数,以降低后续理解和维护代码的成本?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数特别长(300+行代码),有许多if-else判断,但其核心流程都是相似的:在一个有序的program序列中(fwd -> recv_fwd -> fwd -> bwd -> send_bwd -> opt),指定将某个op保留在其中某个program,在其它program中将该op删除,同时为后继的program建立新的数据依赖关系。 这个操作可否抽象成可复用的公共函数,以降低后续理解和维护代码的成本?
Thx,这里我抽象出来,以提高可读性降低维护成本。
@@ -300,7 +304,9 @@ def set_skip_gc_vars(num_micro_batches, job_types, sub_programs, jobs): | |||
return type_to_program | |||
|
|||
|
|||
def set_pir_skip_gc_vars(num_micro_batches, job_types, sub_programs, jobs): | |||
def set_pir_skip_gc_vars( | |||
num_micro_batches, job_types, sub_programs, jobs, type_to_skip_gc_vars={} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为何需要从外部传入type_to_skip_gc_vars,有一些数据依赖关系信息不包含在sub_programs的表达中吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为何需要从外部传入type_to_skip_gc_vars,有一些数据依赖关系信息不包含在sub_programs的表达中吗?
这里主要是处理这种场景:recv_fwd 中 send 的变量,被 fwd 和 bwd 同时使用,但 fwd 阶段不能gc。sub_programs 可以表达,但需要“重复”分析依赖关系,而依赖关系在1f1b split program 时分析过且代码较复杂,为避免重复,这里直接外部传入了。
fixed
#70615
@@ -312,7 +318,17 @@ def set_pir_skip_gc_vars(num_micro_batches, job_types, sub_programs, jobs): | |||
# if a value is renamed by shadow_output, | |||
# it will be used by other sub_programs | |||
type_to_var_names[job_type].add(op.attrs()["output_name"]) | |||
if job_type in ["backward", "backward_w"]: | |||
if os.getenv("FLAGS_enable_p2p_comm_opt", 0) in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个开关是否直接去除或默认打开?如果保留的话建议换个名字,FLAGS_enable_p2p_comm_opt
从名字无法方便地知道底层具体做的是什么通信优化(实际并不是通信优化,只是改变了编排方式,也不会提升性能)。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个开关是否直接去除或默认打开?如果保留的话建议换个名字,
FLAGS_enable_p2p_comm_opt
从名字无法方便地知道底层具体做的是什么通信优化(实际并不是通信优化,只是改变了编排方式,也不会提升性能)。
Thx, 可以。已线下沟通确认,将默认打开。
Sorry to inform you that 4bdf88b's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
48077ca
to
f67d7b6
Compare
f67d7b6
to
a4c1a21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c64fd32
to
0412f68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR Category
Auto Parallel
PR Types
Bug fixes
Description
fix communication hang issue on the GPU-H
依赖PR #70615
TODO: fix vpp hang
PCard-86802