CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[FSDP2] better error msg for cpu offloading #135156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
đź”— Helpful Linksđź§Ş See artifacts and rendered test results at hud.pytorch.org/pr/135156
Note: Links to docs will display an error until the docs builds have been completed. âś… You can merge normally! (1 Unrelated Failure)As of commit 5a6a402 with merge base 356f14e ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…ading" `pytest -s distributed/_composable/fsdp/test_fully_shard_training.py -k test_to_float64_after_init` resolve cpu offload error in TorchTune: pytorch/torchtune#1412 cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ading" `pytest -s distributed/_composable/fsdp/test_fully_shard_training.py -k test_to_float64_after_init` resolve cpu offload error in TorchTune: pytorch/torchtune#1412 cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…cal tensors" `pytest -s distributed/_composable/fsdp/test_fully_shard_training.py -k test_to_float64_after_init` resolve cpu offload error in TorchTune: pytorch/torchtune#1412 cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
synced and I will modify the PR to throw error for gpu state dict instead of moving gpu state dict to cpu implicitly |
…cal tensors" resolve cpu offload error in TorchTune: pytorch/torchtune#1412 this PR constructs DTensor from cpu offloaded local tensor `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…cal tensors" resolve cpu offload error in TorchTune: pytorch/torchtune#1412 this PR constructs DTensor from cpu offloaded local tensor `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…cal tensors" resolve cpu offload error in TorchTune: pytorch/torchtune#1412 this PR constructs DTensor from cpu offloaded local tensor `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…cal tensors" resolve cpu offload error in TorchTune: pytorch/torchtune#1412 this PR constructs DTensor from cpu offloaded local tensor `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: {param_names_not_on_cpu} ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@awgu I repurposed the PR to throw error msg when loading gpu state dict. ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just one suggestion for including the device in the error message for _validate_cpu_offload_params
] | ||
if param_names_not_on_cpu: | ||
raise RuntimeError( | ||
"FSDP parameters should be materialized on cpu when enabling cpu offloading. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can capitalize CPU
"FSDP parameters should be materialized on cpu when enabling cpu offloading. " | |
"FSDP parameters should be materialized on CPU when enabling CPU offloading. " |
if param_names_not_on_cpu: | ||
raise RuntimeError( | ||
"FSDP parameters should be materialized on cpu when enabling cpu offloading. " | ||
'For example, load cpu state dict or call module.to_empty(device="cpu"). ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'For example, load cpu state dict or call module.to_empty(device="cpu"). ' | |
'For example, load a CPU state dict or call module.to_empty(device="cpu"). ' |
param_names_not_on_cpu = [ | ||
fsdp_param._param_fqn | ||
for fsdp_param in self.fsdp_params | ||
if fsdp_param.sharded_param.device.type != "cpu" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a suggestion related to below. Specifically, I think it would be helpful to include the device in the error message in this case.
param_names_not_on_cpu = [ | |
fsdp_param._param_fqn | |
for fsdp_param in self.fsdp_params | |
if fsdp_param.sharded_param.device.type != "cpu" | |
] | |
fsdp_params_not_on_cpu = [ | |
fsdp_param | |
for fsdp_param in self.fsdp_params | |
if fsdp_param.sharded_param.device.type != "cpu" | |
] |
raise RuntimeError( | ||
"FSDP parameters should be materialized on cpu when enabling cpu offloading. " | ||
'For example, load cpu state dict or call module.to_empty(device="cpu"). ' | ||
f"Found following parameters on non-cpu device: {param_names_not_on_cpu}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part 2 of the suggestion of including the device in the error message
(needs some formatting)
f"Found following parameters on non-cpu device: {param_names_not_on_cpu}\n" | |
f"Found following parameters on non-cpu device: {[(fsdp_param._param_fqn, fsdp_param.sharded_param.device) for fsdp_param in fsdp_params_not_on_cpu]]}\n" |
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / build Details for Dev Infra teamRaised by workflow job |
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: pytorch#135156 Approved by: https://github.com/awgu
Stack from ghstack (oldest at bottom):
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward
this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading
pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o