CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[AutoParallel]:fix vpp networking error #69799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel]:fix vpp networking error #69799
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -823,6 +823,76 @@ def _analyze_use_custom_mesh(ops, seg_method, pp_degree): | |||
return non_use_custom_mesh | |||
|
|||
|
|||
def _set_skip_op_var_process_mesh(op, chunk_process_mesh): | |||
|
|||
def get_var_attr(var_dist_attr, process_mesh=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_var_attr(var_dist_attr, process_mesh=None): | |
def get_var_attr_with_process_mesh(var_dist_attr, process_mesh): |
var_dist_attr = var_array_attr[i].as_tensor_dist_attr() | ||
if ( | ||
process_mesh is None | ||
and var_dist_attr.process_mesh == chunk_process_mesh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在需要传入process_mesh=None的场景下,直接传入process_mesh=chunk_process_mesh,是否可以避免这个分支判断。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
and var_dist_attr.process_mesh == chunk_process_mesh | ||
): | ||
var_array_attr[i] = copy_dist_attr_with_new_member( | ||
var_dist_attr, new_process_mesh=chunk_process_mesh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在var_dist_attr.process_mesh == chunk_process_mesh的情况下, 这个new_process_mesh似乎是多余的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
var_dist_attr, new_process_mesh=process_mesh | ||
) | ||
return var_attr | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var_dist_attr==None的情况没有对应的分支处理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
op_input_vars = op.operands_source() | ||
op_output_vars = op.results() | ||
input_process_mesh = set_process_mesh(op_input_vars, None) | ||
set_process_mesh(op_input_vars, input_process_mesh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里连续调用两次set_process_mesh的逻辑很难阅读
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -985,6 +1080,7 @@ def complete_chunk_id(dist_program, startup_program, pipeline_strategy): | |||
|
|||
for idx in range(start_idx, end_idx): | |||
if ops[idx].name() in dist_skip_op_list: | |||
_set_skip_op_var_process_mesh(ops[idx], process_mesh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_set_skip_op_var_process_mesh中的许多逻辑(get_var_attr、set_process_mesh等),与_set_process_mesh_and_chunk_id是类似的,且其中并没有针对特殊op_name的特判。是否可以直接合并,而不需要单独针对dist_skip_op_list写一套设置process_mesh的逻辑,这将使得后续的维护变得更加简单。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
PR Category
Auto Parallel
PR Types
Bug fixes
Description
修复下列场景vpp 组网报错的bug:
split->build.split->reshard
1.VPP设置chunk id时没有考虑op var为vec type的场景。
2.VPP设置process mesh时跳过了dist_skip_op,可能造成dist_skip_op的输入和输出mesh 不一致,在构建反向组网的时候报错。
Pcard-67164