[Auto Parallel] fix save load state_dict #66266

xingmingyyj · 2024-07-19T07:39:06Z

PR Category

Auto Parallel

PR Types

Others

Description

fix save load state_dict

paddle-bot · 2024-07-19T07:39:11Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhangbo9674 · 2024-07-22T02:38:59Z

python/paddle/distributed/checkpoint/load_state_dict.py

@@ -437,6 +439,15 @@ def load_state_dict(
        rank_to_files, missing_keys = get_rank_to_files(
            path, flat_state_dict, process_group, use_dist
        )
+
+        gloabl_rank_to_files = []


get_rank_to_files 已经是通过 global_data_files 分析的结果，这里还需要 all_gather_object(gloabl_rank_to_files)么？

这里不需要all_gather,但是get_rank_to_files中要做一些修改。动半和静半得到的state_dict不同，在静半下，state_dict是部分的，所以的到的necessary_files也是部分的，所以需要对necessary_files做一下all_gather。

zhangbo9674

LGTM

zhiqiu · 2024-07-23T02:28:50Z

python/paddle/distributed/checkpoint/load_state_dict.py

@@ -549,9 +566,11 @@ def load_state_dict(
                        storage_chunk_tensor, src=src_rank, group=process_group
                    )
                else:
+                    tmp_tensor = paddle.assign(cur_chunk_tensor)


why use tmp_tensor, plz add comments.

The memory held by cur_chunk_tensor may be non-contiguous, and the broadcast API does not support this type of tensor.

zhiqiu · 2024-07-23T02:29:23Z

python/paddle/distributed/checkpoint/load_state_dict.py

                )

            if src_rank == item.rank:
-                # assign value locally
-                paddle.assign(storage_chunk_tensor, cur_chunk_tensor)
+                if src_rank == paddle.distributed.get_rank():


what about the else branch?

The condition src_rank == item.rank will be satisfied by all ranks, but only one rank needs to perform the assignment operation. Additionally, when src_rank != paddle.distributed.get_rank(), storage_chunk_tensor may be None.

We’ll add comments in the next PR.

zhiqiu · 2024-07-23T02:29:34Z

python/paddle/distributed/checkpoint/metadata.py

@@ -24,6 +24,7 @@ class LocalTensorMetadata:

    global_offset: Tuple[int]
    local_shape: Tuple[int]
+    dtype: str


why we need dtype?

In static mode, the state_dict is incomplete, so the previous code would trigger key error. Here, the dtype is stored in advance.

zhangbo9674 added 2 commits July 19, 2024 15:16

fix

223ccb5

fix

6391d54

paddle-bot bot added the contributor External developers label Jul 19, 2024

zhangbo9674 reviewed Jul 22, 2024

View reviewed changes

zhangbo9674 and others added 3 commits July 22, 2024 13:55

fix

8300a86

fix

90d5e7d

fix

3b88ff8

zhangbo9674 approved these changes Jul 22, 2024

View reviewed changes

zhiqiu reviewed Jul 23, 2024

View reviewed changes

zhangbo9674 merged commit 40f7c51 into PaddlePaddle:develop Jul 23, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Auto Parallel] fix save load state_dict #66266

[Auto Parallel] fix save load state_dict #66266

Uh oh!

xingmingyyj commented Jul 19, 2024

Uh oh!

paddle-bot bot commented Jul 19, 2024

Uh oh!

zhangbo9674 Jul 22, 2024

Uh oh!

xingmingyyj Jul 22, 2024

Uh oh!

zhangbo9674 left a comment

Uh oh!

zhiqiu Jul 23, 2024

Uh oh!

xingmingyyj Jul 23, 2024

Uh oh!

zhiqiu Jul 23, 2024

Uh oh!

xingmingyyj Jul 23, 2024

Uh oh!

xingmingyyj Jul 23, 2024

Uh oh!

zhiqiu Jul 23, 2024

Uh oh!

xingmingyyj Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

[Auto Parallel] fix save load state_dict #66266

[Auto Parallel] fix save load state_dict #66266

Uh oh!

Conversation

xingmingyyj commented Jul 19, 2024

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jul 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangbo9674 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!