[Dist Dialect] Fix the bug of PIR infer_spmd when op inputs are inconsistent #67052

pkuzyc · 2024-08-05T10:47:42Z

PR Category

Auto Parallel

PR Types

Bug fixes

Description

Pcard-67164

The input args of infer_meta and infer_spmd function may be different from the operation's input. The original PIR InferMeta function has not handled this case. This pr fix this bug.

For example, in StackGrad op, its operation inputs are [x, out_grad], while its spmd_function's inputs are [out_grad].

The original InferSpmd part in InferMeta function:

auto spmd_info = phi::distributed::StackGradInferSpmd(dist_meta_out_grad, axis);
DebugInfoForInferSpmd("StackGradOp", spmd_info);
// Raise error here, because its infer spmd function has only one input, so 
// spmd_info.first.size() is 1u. 
PADDLE_ENFORCE_EQ(spmd_info.first.size(), 2u, common::errors::Unavailable(
    "Size of spmd_info.first for op[SumGradOp]is unexpected."));
// The input x's dist_attr is not in spmd_info, so also incorrect here.
for(auto& arg_dist : spmd_info.first) {{
    dist_operand_attrs.push_back(CvtToPirAttr(arg_dist));
}}

After fix the bug:

auto dist_meta_out_grad = CvtToDistMetaTensor(out_grad_.type().dyn_cast<DistDenseTensorType>());
auto spmd_info = phi::distributed::StackGradInferSpmd(dist_meta_out_grad, axis);
DebugInfoForInferSpmd("StackGradOp", spmd_info);
// spmd_info.first.size() is equal to infer_spmd function's input size
PADDLE_ENFORCE_EQ(spmd_info.first.size(), 1u, common::errors::Unavailable(
    "Size of spmd_info.first for op[StackGradOp]is unexpected."));
// Get the dist_attr from operation input
dist_operand_attrs.push_back(GetTensorDistAttrArray(x));
dist_operand_attrs.push_back(CvtToPirAttr(spmd_info.first[0]));

paddle-bot · 2024-08-05T10:47:47Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JZ-LIANG · 2024-08-06T02:50:44Z

paddle/fluid/pir/dialect/op_generator/op_infermeta_func_gen.py

 """
    dist_branch_str += TEMPLATE.format(
        spmd_func=spmd_rule_func,
        args=', '.join(infer_spmd_args_list),
-        input_size=len(op_info.input_name_list),
+        input_size=spmd_input_value_num,


why not keep the constrain which forces the infermate and inferspmd to have the same input signature ?

the inferspmd of stack_grad could have redundant input to meet the above constrain.

The input args of infer_spmd is consistent with infer_meta. The problem is that the auto parallel part in PIR has not handled the case when the infer_meta args (i.e. infer_spmd args) are inconsistent with the op input.

JZ-LIANG reviewed Aug 6, 2024

View reviewed changes

winter-wang previously approved these changes Aug 6, 2024

View reviewed changes

pkuzyc added 3 commits August 6, 2024 14:00

fix the bug of PIR infer_spmd when op inputs are inconsistent

e5426b1

fix code format

63ea681

add unit test

3d09d32

pkuzyc dismissed winter-wang’s stale review via 3d09d32 August 6, 2024 06:01

pkuzyc force-pushed the fix_spmd_input_bug branch from 6500240 to 3d09d32 Compare August 6, 2024 06:01

pkuzyc marked this pull request as draft August 6, 2024 07:27

pkuzyc marked this pull request as ready for review August 6, 2024 07:27

winter-wang approved these changes Aug 7, 2024

View reviewed changes

winter-wang merged commit 9f7c60d into PaddlePaddle:develop Aug 7, 2024
31 checks passed

pkuzyc deleted the fix_spmd_input_bug branch August 12, 2024 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dist Dialect] Fix the bug of PIR infer_spmd when op inputs are inconsistent #67052

[Dist Dialect] Fix the bug of PIR infer_spmd when op inputs are inconsistent #67052

Uh oh!

pkuzyc commented Aug 5, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Aug 5, 2024

Uh oh!

JZ-LIANG Aug 6, 2024 •

edited

Loading

Uh oh!

pkuzyc Aug 6, 2024

Uh oh!

Uh oh!

Uh oh!

[Dist Dialect] Fix the bug of PIR infer_spmd when op inputs are inconsistent #67052

[Dist Dialect] Fix the bug of PIR infer_spmd when op inputs are inconsistent #67052

Uh oh!

Conversation

pkuzyc commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Aug 5, 2024

Uh oh!

JZ-LIANG Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkuzyc Aug 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pkuzyc commented Aug 5, 2024 •

edited

Loading

JZ-LIANG Aug 6, 2024 •

edited

Loading