[DistDialect] add python reshard pass in pir #63362

hitywt · 2024-04-09T12:20:38Z

PR Category

Auto Parallel

PR Types

New features

Description

Pcard-67164

新增reshard_pass、reshard_func注册功能，p_to_r、p_to_r_cross_mesh reshard功能实现
新增set_insertion_point_after python接口
修复pir下通信算子api无法生成问题
修复无output的算子生成模版问题

paddle-bot · 2024-04-09T12:20:44Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

winter-wang · 2024-04-15T02:36:57Z

paddle/fluid/pir/dialect/distributed/ir/dist_api.cc

@@ -30,10 +30,14 @@ namespace dialect {

 pir::Value shard_tensor(const pir::Value& x,
                        const phi::distributed::ProcessMesh& process_mesh,
-                        const std::vector<int64_t>& dims_mapping) {
+                        const std::vector<int64_t>& dims_mapping,
+                        const std::vector<int64_t>& partial_dims) {


shared_tensor的参数是不是和reshard api的参数一致比较好？感觉一个接收dims_mapping+partial_dims，一个接收placements，感觉怪怪的

api.py里面的shard_tensor api的底层似乎也调用了该api，这个api的接口变了，api.py中的调用也需要适配吧。我看pr中似乎没有修改api.py/

api.py里面的shard_tensor api的底层似乎也调用了该api，这个api的接口变了，api.py中的调用也需要适配吧。我看pr中似乎没有修改api.py/

已修改shard_tensor参数与reshard一致，thanks！
之前为了快速走通流程，所以用了hack和最简单的方法修改的，后面hack的地方最后一个参数是可选，所以api不需要适配

winter-wang · 2024-04-15T02:39:37Z

paddle/fluid/pir/dialect/distributed/ir/dist_api.cc

  pir::IrContext* ctx = pir::IrContext::Instance();
  // support amp for shard_tensor in the future
  paddle::flat_hash_map<int64_t, phi::ReduceType> partial_status;
+  for (size_t i = 0; i < partial_dims.size(); ++i) {
+    partial_status[partial_dims[i]] = phi::ReduceType::kRedSum;


这儿直接硬编码phi::ReduceType::kRedSum是不是有点trick了？

感觉还好，分布式中一般绝大多数都是sum，特殊不是sum 的用户自己再设置

winter-wang · 2024-04-15T02:42:02Z

test/auto_parallel/reshard_p_to_r.py

+                    initializer=paddle.nn.initializer.Uniform(),
+                )
+
+                shard_tensor = paddle._pir_ops.shard_tensor(


Suggested change

shard_tensor = paddle._pir_ops.shard_tensor(

shard_tensor = paddle._C_ops.shard_tensor(

Done, thx！

JZ-LIANG · 2024-04-15T03:27:21Z

python/paddle/distributed/auto_parallel/static/pir_pass.py

+                op_target_dist_attr = op.attrs()[
+                    "op_dist_attr"
+                ].result_dist_attr(0)
+                reshard_func = choose_reshard_func(


只有 src_dist_attr 和 dst_dist_attr还不足翻译所有的 reshard case，还需要传入其他信息，比如 cur_rank：
同样的 src={replicated(), mesh=[0]}, dst={replicated(), mesh=[1]},
如果 cur_rank=0，翻译成 send；如果cur_rank=1, 翻译成 recv。

cur_rank 是choose 传入还是通过全局环境变量内部获取跟动态图对齐。

choose_reshard_func 的函数签名设计一下，如果之后动半的 reshard 逻辑也做到了 python 端，两者如何统一复用这一个接口

只有 src_dist_attr 和 dst_dist_attr还不足翻译所有的 reshard case，还需要传入其他信息，比如 cur_rank：同样的 src={replicated(), mesh=[0]}, dst={replicated(), mesh=[1]}, 如果 cur_rank=0，翻译成 send；如果cur_rank=1, 翻译成 recv。

cur_rank 是choose 传入还是通过全局环境变量内部获取跟动态图对齐。

cur_rank类的信息是静态执行期luanch内部配置，目前是动静统一的获取方式，所以直接在reshard内部获取更合适。此外reshard是继承类，不是每个reshard都需要该参数

choose_reshard_func 的函数签名设计一下，如果之后动半的 reshard 逻辑也做到了 python 端，两者如何统一复用这一个接口

初步确认了一下，当前python的接口及函数复用基本没什么问题

JZ-LIANG · 2024-04-15T03:27:58Z

python/paddle/distributed/auto_parallel/static/pir_pass.py

+                    "op_dist_attr"
+                ].result_dist_attr(0)
+                reshard_func = choose_reshard_func(
+                    op_operand_dist_attr, op_target_dist_attr


命名和 pir 对齐：operand -- result

JZ-LIANG · 2024-04-15T03:29:02Z

python/paddle/distributed/auto_parallel/static/pir_pass.py

+                    op_operand_dist_attr, op_target_dist_attr
+                )
+                reshard_func.reshard(
+                    new_program, op, op_operand_dist_attr, op_target_dist_attr


入参感觉有点冗余？ op 应该包含了 op_operand_dist_attr, op_target_dist_attr 信息

入参感觉有点冗余？ op 应该包含了 op_operand_dist_attr, op_target_dist_attr 信息

program和op是静态图才有的概念，只传递op情况下reshard接口无法动静统一复用

JZ-LIANG · 2024-04-15T03:32:11Z

paddle/fluid/pir/dialect/distributed/ir/dist_api.cc

  pir::IrContext* ctx = pir::IrContext::Instance();
  // support amp for shard_tensor in the future
  paddle::flat_hash_map<int64_t, phi::ReduceType> partial_status;
+  for (size_t i = 0; i < partial_dims.size(); ++i) {
+    partial_status[partial_dims[i]] = phi::ReduceType::kRedSum;


感觉还好，分布式中一般绝大多数都是sum，特殊不是sum 的用户自己再设置

JZ-LIANG · 2024-04-15T03:34:18Z

python/paddle/distributed/auto_parallel/static/reshard_funcs/p_to_r_reshard_func.py

+        return True
+
+    def reshard(
+        self, program, op, src_dist_attr, dst_dist_attr, remove_op=True


不 remove op 的场景是?

不 remove op 的场景是?

这里需要区分是否是reshard op，已经修改remove_op->reshard_op。这是因为不同规则的reshard函数之间有嵌套调用，需要区分是否是reshard_op，reshard_op需要被删除，因此op的插入点前后位置不一样。

JZ-LIANG · 2024-04-15T03:40:33Z

python/paddle/distributed/auto_parallel/static/reshard_funcs/p_to_r_reshard_func.py

+            paddle.pir.set_insertion_point_after(op)
+        group = new_process_group(src_mesh.process_ids)
+        reduced_value = paddle._pir_ops.c_allreduce_sum_(
+            op_value, group.id, False, False


allreduce 的输入应该是 reshard 的输入，而不是输出

allreduce 的输入应该是 reshard 的输入，而不是输出

这里op_value是和op类型相关，如果当前op是reshard op输入就是reshard的输入，否则是op的输出，之所以需要这个判断是因为，是否reshard op插入点不一样(见前面reshard_op参数的解释)。

JZ-LIANG · 2024-04-15T03:41:42Z

test/auto_parallel/reshard_p_to_r.py

+                'builtin.parameter',
+                'pd_op.data',
+                'dist_op.shard_tensor',
+                'pd_op.c_allreduce_sum_',


还需要check op 和输入输出tensor 的dist attr

还需要check op 和输入输出tensor 的dist attr

已添加

JZ-LIANG · 2024-04-15T03:42:17Z

test/auto_parallel/reshard_p_to_r_cross_mesh.py

+        HIDDEN_SIZE = 8
+        MP_SIZE = 2
+
+        with paddle.pir_utils.IrGuard():


单测默认用动转静组网测试

单测默认用动转静组网测试

已添加动转静组网测试

JZ-LIANG · 2024-04-15T03:42:21Z

test/auto_parallel/reshard_p_to_r.py

+        HIDDEN_SIZE = 8
+        MP_SIZE = 2
+
+        with paddle.pir_utils.IrGuard():


单测默认用动转静组网测试

单测默认用动转静组网测试

Done

JZ-LIANG · 2024-04-15T03:43:09Z

test/auto_parallel/reshard_p_to_r_cross_mesh.py

+                reshard_tensor = paddle._pir_ops.reshard(
+                    shard_tensor, self._out_mesh, [dist.Replicate()]
+                )
+            dist_program = apply_reshard_pass_v2(main_program)


需要给通信op 补充分布式属性，让其能够通过 dist2dense pass

需要给通信op 补充分布式属性，让其能够通过 dist2dense pass

已添加

JZ-LIANG · 2024-04-15T03:52:48Z

python/paddle/distributed/auto_parallel/static/reshard_funcs/base_reshard_func.py

+    return None
+
+
+def register_reshard_func(reshard_func):


reshard_func 的分类和命名可以参考一下 torch 和 oneflow(方案设计上，这一版可以先和动半对齐，后续动静统一时可以用新设计方案)

JZ-LIANG

LGTM

zhiqiu

LGTM

zhiqiu · 2024-04-18T08:46:59Z

python/paddle/distributed/auto_parallel/static/reshard_funcs/same_status_reshard_func.py

+        for op in program.global_block().ops:
+            if op.name() == "pd_op.send_v2":
+                op.dist_attr = (
+                    paddle.base.libpaddle.pir.create_op_dist_attribute(
+                        src_mesh, [src_dist_attr], []
+                    )
+                )
+            elif op.name() == "pd_op.recv_v2":
+                op.dist_attr = (
+                    paddle.base.libpaddle.pir.create_op_dist_attribute(
+                        dst_mesh, [], [dst_dist_attr]
+                    )
+                )
+
+        return recv_value.get_defining_op(), dst_dist_attr


why do this?

why do this?
here dst_dist_attr is actually not necessary, will fix in next pr, thx!

XiaoguangHu01

LGTM

hitywt changed the title ~~[DistDialect] Pir add reshard pass py~~ [DistDialect] add python reshard pass in pir Apr 11, 2024

hitywt force-pushed the pir_add_reshard_pass_py branch 2 times, most recently from 88f0cc5 to f31319f Compare April 12, 2024 03:47

wentaoyu added 13 commits April 12, 2024 11:52

add static reshard_func

aab391f

add reshard pass

50d2be8

add p_to_r_corss_mesh

d2362e0

fix code style

6d4f24c

remove useless log

f15af83

update

393513c

update unit test

703c35e

fix code style

aae48e6

rename eval to reshard

895067c

fix ci

e8cbed6

add reshard_func_register.py, update setup.py

8a85c01

fix code style

9da7849

fix shard_tensor api

5d6cc33

hitywt force-pushed the pir_add_reshard_pass_py branch from f31319f to 5d6cc33 Compare April 12, 2024 03:52

fix ut

1ae867e

winter-wang reviewed Apr 15, 2024

View reviewed changes

tiny adjust

5be18a2

winter-wang reviewed Apr 15, 2024

View reviewed changes

wentaoyu added 3 commits April 15, 2024 10:48

adjust is_partial is_replicated

f4371bd

fix comments

f751f8b

fix comments

9e547dc

hitywt force-pushed the pir_add_reshard_pass_py branch from 008c092 to 9e547dc Compare April 15, 2024 03:33

JZ-LIANG reviewed Apr 15, 2024

View reviewed changes

fix comments

75b5729

wentaoyu added 4 commits April 16, 2024 20:34

add to_static ut

c991bbb

update same_status reshard

24e3784

fix code style

3f9f977

add test_utils

ed55a18

JZ-LIANG approved these changes Apr 18, 2024

View reviewed changes

zhiqiu approved these changes Apr 18, 2024

View reviewed changes

hitywt assigned hitywt and jzhang533 Apr 18, 2024

zhangbo9674 approved these changes Apr 18, 2024

View reviewed changes

YuanRisheng approved these changes Apr 18, 2024

View reviewed changes

hitywt assigned XiaoguangHu01 Apr 18, 2024

XiaoguangHu01 approved these changes Apr 18, 2024

View reviewed changes

JZ-LIANG merged commit cd050fc into PaddlePaddle:develop Apr 18, 2024

	shard_tensor = paddle._pir_ops.shard_tensor(
	shard_tensor = paddle._C_ops.shard_tensor(

[DistDialect] add python reshard pass in pir #63362

[DistDialect] add python reshard pass in pir #63362

Uh oh!

Conversation

hitywt commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Apr 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hitywt Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hitywt Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hitywt Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hitywt Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hitywt Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG left a comment

Choose a reason for hiding this comment

hitywt commented Apr 9, 2024 •

edited

Loading

hitywt Apr 15, 2024 •

edited

Loading

hitywt Apr 15, 2024 •

edited

Loading

hitywt Apr 15, 2024 •

edited

Loading

hitywt Apr 15, 2024 •

edited

Loading

hitywt Apr 15, 2024 •

edited

Loading