【AutoParallism】Support semi auto amp #61221

heavyrain-lzy · 2024-01-26T09:53:40Z

PR types

New features

PR changes

Others

Description

Pcard-76459
支持动半的AMP策略。本PR为支持动半AMP主要修改如下：

增加shard_scaler接口，作为动半scaler的接口
修改auto_cast中的涉及到的master_grad逻辑，支持动半
修改TensorAdd的逻辑错误，使得可以正确适用"half+float32"的场景
修改cast的SPMD规则，使其可以传递partial状态
为dist.ProcessMesh增加__hash__成员，便于统计不同PP下的grad
增加相应的单测

zhiqiu · 2024-02-05T02:53:43Z

test/auto_parallel/hybrid_strategy/CMakeLists.txt

@@ -44,9 +44,9 @@ if((WITH_GPU) AND (LINUX))
 endif()
 if((WITH_GPU) AND (LINUX))
  py_test_modules(
-    test_semi_auto_parallel_llama_model_vpp MODULES


why delete vpp

单测应该加到test/auto_parallel/hybrid_strategy/testslist.csv中，然后调用脚本生成该文件。原始的vpp单测是手动添加的，不会再CI中跑，而且这个单测忠慧反馈有问题。后续修改后再添加上。

zhiqiu · 2024-02-05T02:54:27Z

test/amp/test_amp_master_grad.py

@@ -56,7 +56,7 @@ def check_results(
        # the number of operators of this type +2
        self.assertEqual(
            int(op_list['transfer_dtype'].split(',')[0]),
-            total_steps + total_steps * 2 + 2,


why change here

将gradient_accumulator.cc中的phi::AddKernel<T, CONTEXT>(*cpu_ctx, src_tensor, *dst_tensor, dst_tensor); \修改为phi::AddKernel<T, CONTEXT>(*cpu_ctx, *dst_tensor, src_tensor,dst_tensor); \后，在master_grad中的cast变少，经过和张婷确认，这里修改是正确的。

zhiqiu · 2024-02-05T03:00:06Z

python/paddle/distributed/auto_parallel/api.py

+        scaler (paddle.amp.GradScaler): The GradScaler to be sharded.
+
+    Returns:
+        An GradScaler with distributed view.


zhiqiu · 2024-02-05T03:15:51Z

python/paddle/distributed/auto_parallel/api.py

 from paddle.base import unique_name
+from paddle.base.dygraph import to_variable


use to_tensor

zhiqiu · 2024-02-05T03:30:57Z

paddle/phi/infermeta/spmd_rules/elementwise.cc

+  // initialize output dist_attr's process_mesh, batch_dim and dynamic dims with
+  // input dist_attr.
+  TensorDistAttr out_dist_attr =
+      CopyTensorDistAttrWithPartialForOutput(x_dist_attr_src);


u can reuse the ElementwiseUnary function except this line.

ElementwiseUnary 会清除 partial 状态，但是在cast中可以将partial状态流转。

zhiqiu · 2024-02-05T03:31:24Z

python/paddle/amp/auto_cast.py

+
+            if len(amp_global_state().mesh2params):
+                for _, params in amp_global_state().mesh2params.items():
+                    core.eager.set_master_grads(params)


plz comment why

zhiqiu · 2024-02-05T03:32:01Z

python/paddle/distributed/auto_parallel/api.py

+            temp_found_inf = dist.reshard(
+                temp_found_inf, src_mesh, temp_found_inf.placements
+            )


plz comment why

XieYunshen

LGTM
单测删除

zhiqiu

LGTM

sunzhongkai588

LGTM for shard_scale docs

请提供中文文档～

lanxianghit

LGTM for API changes

heavyrain-lzy added 7 commits January 26, 2024 17:07

support semi_auto_amp

cc6092b

support semi_auto_amp

4a63386

Merge remote-tracking branch 'upstream/develop' into support_semi_auto

434702f

add reshard in scaler

9bcb7eb

Merge remote-tracking branch 'upstream/develop' into support_semi_auto

77f1ba3

support master grad in semi-auto

8f751bc

add ut

d9e4520

zhiqiu reviewed Feb 5, 2024

View reviewed changes

heavyrain-lzy added 2 commits February 5, 2024 15:22

polish

2563d8d

polish

494245c

XieYunshen approved these changes Feb 6, 2024

View reviewed changes

zhiqiu approved these changes Feb 6, 2024

View reviewed changes

sunzhongkai588 approved these changes Feb 6, 2024

View reviewed changes

lanxianghit approved these changes Feb 6, 2024

View reviewed changes

heavyrain-lzy merged commit 06e494e into PaddlePaddle:develop Feb 6, 2024

heavyrain-lzy mentioned this pull request Feb 7, 2024

【AutoParallel】add shard_scaler doc PaddlePaddle/docs#6503

Merged

		from paddle.base import unique_name
		from paddle.base.dygraph import to_variable

【AutoParallism】Support semi auto amp #61221

【AutoParallism】Support semi auto amp #61221

Uh oh!

Conversation

heavyrain-lzy commented Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heavyrain-lzy Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XieYunshen left a comment

Choose a reason for hiding this comment

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

sunzhongkai588 left a comment

Choose a reason for hiding this comment

Uh oh!

lanxianghit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

heavyrain-lzy commented Jan 26, 2024 •

edited

Loading

heavyrain-lzy Feb 5, 2024 •

edited

Loading