[FSDP2] better error msg for cpu offloading #135156

weifengpy · 2024-09-04T22:24:28Z

Stack from ghstack (oldest at bottom):

-> [FSDP2] better error msg for cpu offloading #135156

when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward

RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device

this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading

FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight']

pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2024-09-04T22:24:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135156

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5a6a402 with merge base 356f14e ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_cuda_load'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d9d7725 Pull Request resolved: #135156

torch/distributed/_composable/fsdp/_fsdp_param.py

…ading" `pytest -s distributed/_composable/fsdp/test_fully_shard_training.py -k test_to_float64_after_init` resolve cpu offload error in TorchTune: pytorch/torchtune#1412 cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

torch/distributed/_composable/fsdp/_fsdp_param.py

…ading" `pytest -s distributed/_composable/fsdp/test_fully_shard_training.py -k test_to_float64_after_init` resolve cpu offload error in TorchTune: pytorch/torchtune#1412 cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

torch/distributed/_composable/fsdp/_fsdp_param.py

…cal tensors" `pytest -s distributed/_composable/fsdp/test_fully_shard_training.py -k test_to_float64_after_init` resolve cpu offload error in TorchTune: pytorch/torchtune#1412 cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 64b2d1c Pull Request resolved: #135156

weifengpy · 2024-09-07T00:12:01Z

synced and I will modify the PR to throw error for gpu state dict instead of moving gpu state dict to cpu implicitly

…cal tensors" resolve cpu offload error in TorchTune: pytorch/torchtune#1412 this PR constructs DTensor from cpu offloaded local tensor `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: {param_names_not_on_cpu} ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 4f88a48 Pull Request resolved: #135156

weifengpy · 2024-09-10T17:03:47Z

@awgu I repurposed the PR to throw error msg when loading gpu state dict. ready for review

awgu

just one suggestion for including the device in the error message for _validate_cpu_offload_params

awgu · 2024-09-10T17:07:55Z

torch/distributed/_composable/fsdp/_fsdp_param_group.py

+        ]
+        if param_names_not_on_cpu:
+            raise RuntimeError(
+                "FSDP parameters should be materialized on cpu when enabling cpu offloading. "


nit: I think we can capitalize CPU

Suggested change

"FSDP parameters should be materialized on cpu when enabling cpu offloading. "

"FSDP parameters should be materialized on CPU when enabling CPU offloading. "

awgu · 2024-09-10T17:08:18Z

torch/distributed/_composable/fsdp/_fsdp_param_group.py

+        if param_names_not_on_cpu:
+            raise RuntimeError(
+                "FSDP parameters should be materialized on cpu when enabling cpu offloading. "
+                'For example, load cpu state dict or call module.to_empty(device="cpu"). '


Suggested change

'For example, load cpu state dict or call module.to_empty(device="cpu"). '

'For example, load a CPU state dict or call module.to_empty(device="cpu"). '

awgu · 2024-09-10T17:12:27Z

torch/distributed/_composable/fsdp/_fsdp_param_group.py

+        param_names_not_on_cpu = [
+            fsdp_param._param_fqn
+            for fsdp_param in self.fsdp_params
+            if fsdp_param.sharded_param.device.type != "cpu"
+        ]


This is just a suggestion related to below. Specifically, I think it would be helpful to include the device in the error message in this case.

Suggested change

param_names_not_on_cpu = [

fsdp_param._param_fqn

for fsdp_param in self.fsdp_params

if fsdp_param.sharded_param.device.type != "cpu"

]

fsdp_params_not_on_cpu = [

fsdp_param

for fsdp_param in self.fsdp_params

if fsdp_param.sharded_param.device.type != "cpu"

]

awgu · 2024-09-10T17:13:02Z

torch/distributed/_composable/fsdp/_fsdp_param_group.py

+            raise RuntimeError(
+                "FSDP parameters should be materialized on cpu when enabling cpu offloading. "
+                'For example, load cpu state dict or call module.to_empty(device="cpu"). '
+                f"Found following parameters on non-cpu device: {param_names_not_on_cpu}\n"


part 2 of the suggestion of including the device in the error message
(needs some formatting)

Suggested change

f"Found following parameters on non-cpu device: {param_names_not_on_cpu}\n"

f"Found following parameters on non-cpu device: {[(fsdp_param._param_fqn, fsdp_param.sharded_param.device) for fsdp_param in fsdp_params_not_on_cpu]]}\n"

when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 61ecbe5 Pull Request resolved: #135156

weifengpy · 2024-09-10T20:21:38Z

@pytorchmergebot merge

pytorchmergebot · 2024-09-10T20:24:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-11T02:23:12Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

weifengpy · 2024-09-11T02:28:55Z

@pytorchmergebot merge

pytorchmergebot · 2024-09-11T02:30:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-11T08:25:30Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build

Details for Dev Infra team

Raised by workflow job

when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 037e2d8 Pull Request resolved: #135156

weifengpy · 2024-09-11T23:57:09Z

@pytorchmergebot merge

pytorchmergebot · 2024-09-11T23:59:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: pytorch#135156 Approved by: https://github.com/awgu

[FSDP2] move DTensor and local tensor to cpu for cpu offloading

db2f26f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Sep 4, 2024

weifengpy added a commit that referenced this pull request Sep 4, 2024

[FSDP2] move DTensor and local tensor to cpu for cpu offloading

8a63722

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d9d7725 Pull Request resolved: #135156

weifengpy marked this pull request as draft September 4, 2024 22:24

weifengpy mentioned this pull request Sep 4, 2024

[FSDP2] full finetune: move state dict to cpu when cpu offloading pytorch/torchtune#1495

Merged

4 tasks

weifengpy commented Sep 5, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

weifengpy commented Sep 5, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

weifengpy commented Sep 5, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

weifengpy commented Sep 5, 2024

View reviewed changes

torch/distributed/_composable/fsdp/_fsdp_param.py Outdated Show resolved Hide resolved

weifengpy changed the title ~~[FSDP2] move DTensor and local tensor to cpu for cpu offloading~~ [FSDP2] construct DTensor parameters from cpu offloaded local tensors Sep 5, 2024

weifengpy added a commit that referenced this pull request Sep 5, 2024

[FSDP2] move DTensor and local tensor to cpu for cpu offloading

389faae

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 64b2d1c Pull Request resolved: #135156

weifengpy marked this pull request as ready for review September 5, 2024 22:47

weifengpy requested a review from awgu September 5, 2024 22:47

weifengpy mentioned this pull request Sep 5, 2024

self.sharded_param.device = cuda while #135179

Closed

weifengpy marked this pull request as draft September 7, 2024 00:11

weifengpy mentioned this pull request Sep 10, 2024

[FSDP2] LoRA/QLoRA move state dict to cpu pytorch/torchtune#1531

Closed

4 tasks

weifengpy added a commit that referenced this pull request Sep 10, 2024

[FSDP2] move DTensor and local tensor to cpu for cpu offloading

2fdaeb2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 4f88a48 Pull Request resolved: #135156

weifengpy marked this pull request as ready for review September 10, 2024 17:02

awgu approved these changes Sep 10, 2024

View reviewed changes

weifengpy added a commit that referenced this pull request Sep 10, 2024

[FSDP2] move DTensor and local tensor to cpu for cpu offloading

81b610f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 61ecbe5 Pull Request resolved: #135156

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 10, 2024

pytorchmergebot added the merging label Sep 10, 2024

pytorchmergebot removed the merging label Sep 11, 2024

weifengpy added a commit that referenced this pull request Sep 11, 2024

[FSDP2] move DTensor and local tensor to cpu for cpu offloading

e8500b2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 037e2d8 Pull Request resolved: #135156

pytorchmergebot added the merging label Sep 11, 2024

pytorchmergebot added the Merged label Sep 12, 2024

pytorchmergebot closed this in d270e2d Sep 12, 2024

pytorchmergebot removed the merging label Sep 12, 2024

github-actions bot deleted the gh/weifengpy/14/head branch October 12, 2024 02:07

	"FSDP parameters should be materialized on cpu when enabling cpu offloading. "
	"FSDP parameters should be materialized on CPU when enabling CPU offloading. "

	'For example, load cpu state dict or call module.to_empty(device="cpu"). '
	'For example, load a CPU state dict or call module.to_empty(device="cpu"). '

	f"Found following parameters on non-cpu device: {param_names_not_on_cpu}\n"
	f"Found following parameters on non-cpu device: {[(fsdp_param._param_fqn, fsdp_param.sharded_param.device) for fsdp_param in fsdp_params_not_on_cpu]]}\n"

[FSDP2] better error msg for cpu offloading #135156

[FSDP2] better error msg for cpu offloading #135156

Uh oh!

Conversation

weifengpy commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135156

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weifengpy commented Sep 7, 2024

Uh oh!

weifengpy commented Sep 10, 2024

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

awgu Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

weifengpy commented Sep 10, 2024

Uh oh!

pytorchmergebot commented Sep 10, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 11, 2024

Uh oh!

weifengpy commented Sep 11, 2024

Uh oh!

pytorchmergebot commented Sep 11, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 11, 2024

Merge failed

Uh oh!

weifengpy commented Sep 11, 2024

Uh oh!

pytorchmergebot commented Sep 11, 2024

Merge started

Uh oh!

Uh oh!

weifengpy commented Sep 4, 2024 •

edited

Loading

pytorch-bot bot commented Sep 4, 2024 •

edited

Loading