[AutoParallel] Generate spmd and reshard into phi grad api #57119

chenwhql · 2023-09-08T12:18:44Z

PR types

New features

PR changes

Others

Description

Pcard-73145

[AutoParallel] Generate spmd and reshard into phi grad api

在PHI API和动态图反向执行流程中生成切分推导和转换的逻辑，关键功能如下：

PHI反向变动和新增的逻辑

// 1. InferSpmd (Infer DistAttr of Inputs&Outputs){} - 反向也需要切分推导
// 2. Create Temporary Output & Prepare Dist and Dense Output{} - 反向输出先产生临时值，最后reshard之后再赋值到结果
...
// 5. Reshard Input{}\n - 反向也需要reshard input，将一些暂时无法处理的情况reshard为Replicated
...
// 9. Reshard Output{}\n - 反向需要reshard output，将dense kernel计算的临时值reshard为需要的输出
...

动态图Autograd新增的逻辑

// Set DistAttr of Out Tensor for semi-auto parallel
if (IsRunSemiAutoParallel()) {{
  egr::EagerUtils::SetGradOutputDistAttr(out_metas, {}, {});
}}

动态图反向计算之前，要将前向输入的dist attr传递给反向输出，否则反向输出不知道应该切分为什么状态，因此增加了响应的属性传递，并提前创建了反向 phi api 的DistTensor 输出

PHI前向API变动

...
// 9. Reshard Partial Output to Replicated (Temporary){}\n - 现在Partial不支持算子间传递，因此前向输出时将Partial的输出全部reshard为Replicated
...

关于matmul的反向切分推导规则

当前PR中的实现是不完备的，原因在于matmul算子的前反向infermeta实现均存在问题，其正确的infermeta均是在kernel运算时进行的：

前向infermeta只能处理常规情况，一些特殊情况处理不了，会出错
反向infermeta假设grad和前向输入的shape一致而省略了数学推导过程，这导致切分后的tensor在matmul反向运算时报错

关于matmul的问题后续在另一个PR中处理，当前PR中对于无法处理的情况，转换为replicated进行计算，先不阻塞主体架构验证工作

Refactor matmul grad infermeta and kernel impl #57178

… ap/generate_spmd_and_reshard_grad

paddle-bot · 2023-09-08T12:18:50Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… ap/generate_spmd_and_reshard_grad

LiYuRio · 2023-09-14T03:23:35Z

paddle/phi/infermeta/spmd_rules/utils.cc

+  TensorDistAttr dst_dist_attr = CopyTensorDistAttrForOutput(dist_attr);
+  std::vector<int64_t> dims_mapping(dist_attr.dims_mapping().size(), -1);
+  dst_dist_attr.set_dims_mapping(dims_mapping);
+  dst_dist_attr.clean_partial_status();
+  return dst_dist_attr;


其实这里可以直接构造，因为TensorDistAttr的构造函数创建的就是replicated状态的，只需要再set一下process_mesh就好，省去这里先复制一些不必要信息的逻辑。
如果这里想要保留Input的一些信息，在CopyTensorDistAttrForOutput这个函数里，已经做了clean_partial操作，没必要再来一次

done, thx, 删除掉冗余调用的clean_partial_status

LiYuRio · 2023-09-14T03:27:16Z

paddle/phi/api/lib/data_transform.cc

+  if (src_tensor->dist_attr() != dist_tensor->dist_attr()) {
+    VLOG(6) << "BwdAPI KernelOut to ApiOut - "
+            << ReshardDebugInfo(*src_tensor, dist_tensor->dist_attr());
+    auto* func = phi::distributed::ChooseProperReshardFunction(
+        *src_tensor, dist_tensor->dist_attr());
+    func->Eval(dev_ctx, *src_tensor, dist_tensor->dist_attr(), dist_tensor);
+  } else {
+    // shallow copy dense tensor
+    *dist_tensor->unsafe_mutable_value() = src_tensor->value();
+  }


给Reshard记个TODO，这里后面应该把copy变成reshard的一条规则，这里就不用分支判断了

LiYuRio · 2023-09-14T03:50:23Z

paddle/phi/api/lib/data_transform.cc

+  if (out_tensor->dist_attr().is_partial()) {
+    auto dist_attr = out_tensor->dist_attr();
+    dist_attr.clean_partial_status();
+    VLOG(6) << "FwdAPI Output P2R - "
+            << ReshardDebugInfo(*out_tensor, dist_attr);
+    auto* func =
+        phi::distributed::ChooseProperReshardFunction(*out_tensor, dist_attr);
+    func->Eval(dev_ctx, *out_tensor, dist_attr, out_tensor);


这里的逻辑貌似有漏洞，假如out_tensor的dist_attr是[S(0), P]，那经过这个转换后，只能变成[S(0), R]，是想要这样的状态，还是[R, R]呢

是前者，这里只截断P的传递，其他的状态仍然保留，函数命名语义不准确，已修改

GhostScreaming

LGTM

GhostScreaming

LGTM

zyfncg · 2023-09-15T07:45:08Z

paddle/phi/api/yaml/generator/dist_api_gen.py

+            kernel_param = self.kernel['param']
+            if kernel_param is None:
+                kernel_param = input_names + attr_names

-            infer_meta_params = (
-                self.infer_meta['param']
-                if self.infer_meta['param'] is not None
-                else input_names + attr_names
-            )
            input_decl_code = ""
-            self.input_args_code = ""
-            for param in infer_meta_params:
+            input_args_code = ""
+            for param in kernel_param:


spmd_rule 在infermeta的配置项下，但用kernel_param作为输入参数感觉行为上有些割裂，这里可以直接用op的args参数吗？

不一定可以，参与后续dense kernel计算的值才需要切分处理

线下讨论TODO，后续应该仍然和Infermeta的参数列表保持一致，若参数不一致，需要修改infermeta的参数确保其一致，特别是反向的算子存在这个问题

LiYuRio

LGTM

…dle#57119) * add grad spmd and reshard * add debug info and backward gen code * re impl matmul grad infer meta * remove matmul infershape in runtime * add eager gen code * revert matmul change * refactor matmul grad spmd rules * polish details * revert p2r change * fix set dist attr error * fix conflict * polish details * polish details by comments * remove semi prefix * add more tests for coverage

chenwhql added 4 commits September 7, 2023 02:04

add grad spmd and reshard

ff01296

add debug info and backward gen code

1691c95

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d905993

… ap/generate_spmd_and_reshard_grad

re impl matmul grad infer meta

5ab4ef0

chenwhql added 12 commits September 11, 2023 02:33

remove matmul infershape in runtime

561583c

add eager gen code

007ac54

revert matmul change

72ef32c

refactor matmul grad spmd rules

7f01dc6

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

aad337c

… ap/generate_spmd_and_reshard_grad

Merge branch 'develop' into ap/generate_spmd_and_reshard_grad

94eae49

Merge branch 'develop' into ap/generate_spmd_and_reshard_grad

9b9c699

polish details

d8c96b9

revert p2r change

b718e84

fix set dist attr error

6b739de

resolve conflict with develop

5630091

fix conflict

fa0fb99

LiYuRio reviewed Sep 14, 2023

View reviewed changes

chenwhql added 3 commits September 14, 2023 03:55

polish details

8f43989

polish details by comments

9f2d143

remove semi prefix

81b539f

GhostScreaming previously approved these changes Sep 14, 2023

View reviewed changes

add more tests for coverage

15f6155

chenwhql dismissed GhostScreaming’s stale review via 15f6155 September 15, 2023 02:08

GhostScreaming approved these changes Sep 15, 2023

View reviewed changes

zyfncg reviewed Sep 15, 2023

View reviewed changes

zyfncg approved these changes Sep 15, 2023

View reviewed changes

LiYuRio approved these changes Sep 15, 2023

View reviewed changes

chenwhql merged commit b0fc0b7 into PaddlePaddle:develop Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoParallel] Generate spmd and reshard into phi grad api #57119

[AutoParallel] Generate spmd and reshard into phi grad api #57119

Uh oh!

chenwhql commented Sep 8, 2023 •

edited

Loading

Uh oh!

paddle-bot bot commented Sep 8, 2023

Uh oh!

LiYuRio Sep 14, 2023

Uh oh!

chenwhql Sep 14, 2023

Uh oh!

LiYuRio Sep 14, 2023

Uh oh!

chenwhql Sep 14, 2023

Uh oh!

LiYuRio Sep 14, 2023

Uh oh!

chenwhql Sep 14, 2023

Uh oh!

GhostScreaming left a comment

Uh oh!

GhostScreaming left a comment

Uh oh!

zyfncg Sep 15, 2023

Uh oh!

chenwhql Sep 15, 2023

Uh oh!

chenwhql Sep 15, 2023

Uh oh!

LiYuRio left a comment

Uh oh!

Uh oh!

[AutoParallel] Generate spmd and reshard into phi grad api #57119

[AutoParallel] Generate spmd and reshard into phi grad api #57119

Uh oh!

Conversation

chenwhql commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Sep 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GhostScreaming left a comment

Choose a reason for hiding this comment

Uh oh!

GhostScreaming left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiYuRio left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chenwhql commented Sep 8, 2023 •

edited

Loading