[Distributed] Improve efficiency of NaN checker #135414

kwen2501 · 2024-09-07T05:26:53Z

Some customers would like to run the NaN checks on the fly, so we are improving its efficiency.

Benchmarking

Allreduce 2G floats. TORCH_NCCL_NAN_CHECK=1
Red kernel: ncclAllreduce
Blue kernel: Nan check

Comparison with torch ops:

Let's say a user manually check for NaNs with the following torch ops before all-reduce:

torch.any(torch.isnan(x))

So our perf is on-par with torch ops.

Changes

Load from vidmem using "big packs" of 16 bytes
Bump blockDim.x from 256 to 512
Separate loads and checks into two loops, each of 8 iterations
Unroll the loops
Templated functions for checking NaN in a "big pack" based on dtype

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o
Special thanks to @jbachan from NCCL!

pytorch-bot · 2024-09-07T05:26:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135414

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ea492a5 with merge base 5f57be7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2024-09-07T13:48:41Z

Base tensor is guaranteed to have 16-byte alignment, but a view into it does not have to be 🤔

kwen2501 · 2024-09-08T08:30:59Z

@awgu Good catch, thanks! Care taken now: 6e7340a

wconstab

i think, given the increased complexity of the kernel, it would be good to add a test that more carefully checks for cases where the NaN detector misses a NaN.

given we can't realistically afford to do a test where we loop through the indices of a large tensor and set each value to NaN exhaustively, do you think it makes sense to do some combination of (a) exhaustive testing on a small-medium tensor that still is large enough to exercise both the unrolled and suffix loops, (b) a test that sets a random index to NaN, so at least throughout many repetitions of CI we could expect a 'flaky' signal if we are missing certain values?

kwen2501 · 2024-09-09T16:22:25Z

Re (a): yes, I can add a test that shmoo's through small-medium sizes (and data types).

Re (b): Yep, I think the existing tests can be modified to support the randomness.

kwen2501 · 2024-09-09T17:18:04Z

@wconstab Test modified to cover wider size range.

wconstab · 2024-09-09T17:26:10Z

one more thing- could be a separate PR, but we are still missing fp8 iiuc. We should definitely cover this. If its convenient to add in this PR, it might make sense if it is yet one more case of the kernel template

kwen2501 · 2024-09-09T17:49:49Z

@wconstab can it be in a separate PR for cleanness reason?
I also need to study CUDA's FP8 support.

wconstab · 2024-09-09T18:24:51Z

test/distributed/test_c10d_nccl.py

        # randomly pick an nan element
        i = random.randint(0, nan_tensor.size(0) - 1)
-        j = random.randint(0, nan_tensor.size(1) - 1)
-        nan_tensor[i, j] = float("nan")
+        index = (i,) * len(size)


nit: this appears to only put NaN values on the I diagonal. What about something like this?

index = tuple([randint(...) for _ in len(size)])

Thanks, adopted.

wconstab · 2024-09-09T18:29:42Z

torch/csrc/distributed/c10d/NanCheck.cu

+// EltPerPack would be greater than 8 if falling in this case.
+
+template <typename T, int EltPerPack>
+struct CheckBytePack {


iiuc this generalized kernel would only be used for float8? i guess in a later PR, you would possibly replace this by a specialized one too?

wconstab · 2024-09-09T18:33:05Z

torch/csrc/distributed/c10d/NanCheck.cu

+  int nWorkers = blockDim.x * gridDim.x;
+  // First load values from global memory into tmp buffer
+  #pragma unroll 8
+  for (int j = 0; j < UNROLL; j++) {


hm, does checkChunk get called with a different ptr offset for each thread?

Below at line 134, checkChunk is called like this:
checkChunk<T>(ptr + offset);
offset accounts for different offsets for different threads.

wconstab · 2024-09-09T18:42:42Z

torch/csrc/distributed/c10d/NanCheck.cu

+  // We just do regular load and check
+  for (; offset < sizeInBP; offset += blockDim.x * gridDim.x) {
+    BytePack tmp = ptr[offset];
+    CheckBytePack<T, sizeof(BytePack)/sizeof(T)>::check(&tmp);


confused, if we are sure we have enough data left for one call to CheckBytePack<T, B/T> doesn't that also mean we have enough data for a faster call to CheckBytePack?

Do you mean why not a call to checkChunk?
The reason is that checkChunk checks on 8*BytePack in one call, while CheckBytePack checks 1 BytePack.
This slow loop here accounts for the case that we don't have 8*BytePack left.

oh- yes, i got confused between the two. i think this makes sense.

So in summary, i think the algorithm is

Pre: always < 1 BytePack, since =1 would imply its already 16B aligned, so don't even use 'CheckByte', do a local check

Body: process chunks of 8 BytePack (e.g. 8*16 = 128B chunks) per call

Tail: since alignment is now guaranteed, just process (N < 8) 16B Bytepacks individually

shuqiangzhang · 2024-09-10T16:30:06Z

I might understand wrong, but If "So our perf is on-par with torch ops.", why not just use torch.any(torch.isnan(x))?

kwen2501 · 2024-09-10T17:15:55Z

Good question. I had the same question too.

So the reasoning goes like this: (backward)

we need to stop communication from spreading NaNs;
we need to stop NCCL kernel from launching;
torch.any(torch.isnan(x)) does not stop NCCL kernel from launching.

Re why "we need to stop communication from spreading NaNs", here is a view from @wconstab :
"technically if we can be sure which rank (or, even which host) detected the first nan, then its OK to let the nan spread to some other hosts. but in practice i dont know if we have good enough way to align our logs on different hosts, so if we let the nan spread to a few other hosts we may lose track of which one was first”

wconstab · 2024-09-10T23:44:05Z

why not just use torch.any(torch.isnan(x))?

another flavor on this is, we could use it if we could easily modify it to trap() on nan, instead of asynchronously producing a bool tensor that someone (who?) has to check (when?). We definitely don't want to do a cuda synchronize after each nan check and check it on the cpu side.

kwen2501 · 2024-09-11T05:44:55Z

@pytorchbot merge

pytorchmergebot · 2024-09-11T05:46:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@jbachan

Some customers would like to run the NaN checks on the fly, so we are improving its efficiency. ## Benchmarking Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1` Red kernel: ncclAllreduce Blue kernel: Nan check <img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3"> ## Comparison with torch ops: Let's say a user manually check for NaNs with the following torch ops before all-reduce: ``` torch.any(torch.isnan(x)) ``` <img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b"> So our perf is on-par with torch ops. ## Changes - Load from vidmem using "big packs" of 16 bytes - Bump `blockDim.x` from 256 to 512 - Separate loads and checks into two loops, each of 8 iterations - Unroll the loops - Templated functions for checking NaN in a "big pack" based on dtype Special thanks to @jbachan from NCCL! Pull Request resolved: pytorch#135414 Approved by: https://github.com/wconstab

kwen2501 added 5 commits September 6, 2024 11:57

bytepack

3dfb41b

Separate load from check

6fa5516

Add slow path

0c140cd

Check byte pack

4339528

Templated BytePack check function based on dtype

43ce30e

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 7, 2024

kwen2501 requested review from shuqiangzhang, wconstab, fduwjj and zdevito September 7, 2024 05:33

kwen2501 added 2 commits September 8, 2024 00:54

Handle unaligned case

6e7340a

Fix AMD compile error

0889c57

wconstab reviewed Sep 9, 2024

View reviewed changes

kwen2501 force-pushed the nan_perf branch from 9a86bb4 to 0ff488d Compare September 9, 2024 17:30

wconstab reviewed Sep 9, 2024

View reviewed changes

kwen2501 added 2 commits September 9, 2024 13:38

Cover wider size range in test

f9efc94

Handle unaligned tail

ea492a5

kwen2501 force-pushed the nan_perf branch from 0ff488d to ea492a5 Compare September 10, 2024 00:04

wconstab approved these changes Sep 10, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 11, 2024

pytorchmergebot added the merging label Sep 11, 2024

pytorchmergebot added the Merged label Sep 11, 2024

pytorchmergebot closed this in 443c015 Sep 11, 2024

pytorchmergebot removed the merging label Sep 11, 2024

github-actions bot deleted the nan_perf branch October 12, 2024 02:06

Chao1Han mentioned this pull request Jun 23, 2025

Add an option for NAN check for xccl intel/torch-xpu-ops#1756

Open

[Distributed] Improve efficiency of NaN checker #135414

[Distributed] Improve efficiency of NaN checker #135414

Uh oh!

Conversation

kwen2501 commented Sep 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking

Comparison with torch ops:

Changes

Uh oh!

pytorch-bot bot commented Sep 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135414

✅ No Failures

Uh oh!

awgu commented Sep 7, 2024

Uh oh!

kwen2501 commented Sep 8, 2024

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Sep 9, 2024

Uh oh!

kwen2501 commented Sep 9, 2024

Uh oh!

wconstab commented Sep 9, 2024

Uh oh!

kwen2501 commented Sep 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Sep 10, 2024

Uh oh!

kwen2501 commented Sep 10, 2024

Uh oh!

wconstab commented Sep 10, 2024

Uh oh!

kwen2501 commented Sep 11, 2024

Uh oh!

pytorchmergebot commented Sep 11, 2024

Merge started

Uh oh!

Uh oh!

kwen2501 commented Sep 7, 2024 •

edited

Loading

pytorch-bot bot commented Sep 7, 2024 •

edited

Loading

kwen2501 Sep 9, 2024 •

edited

Loading