[Inductor] fix device error for NopKernelSchedulerNode #141372

BoyuanFeng · 2024-11-22T19:53:26Z

This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors.

Prior to the PR:

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        breakpoint()
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )

After the PR:

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )

Fixes #141010

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-11-22T19:53:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141372

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit f412f85 with merge base 51b7528 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, lf.linux.4xlarge) (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge) (gh) (trunk failure)
##[error]Process completed with exit code 128.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

BoyuanFeng · 2024-11-27T00:19:33Z

torch/_inductor/graph.py

+                isinstance(buffer, ir.ComputedBuffer)
+                and buffer.is_zero_elements()
+                and device == torch.device("cpu")
+            )


Prior implementation also skips empty CUDA tensors, leading to wrong V.graph.device_ops.

facebook-github-bot · 2024-11-27T00:20:52Z

@BoyuanFeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

drisspg

I am gunna let @eellison give the stamp on this one

BoyuanFeng · 2024-12-04T21:40:02Z

torch/_inductor/scheduler.py

-            if not isinstance(node, NopKernelSchedulerNode) and (
-                device := node.get_device()
-            ):
+            if device := node.get_device():


empty_strided_cuda generates a NopKernelSchedulerNode. Prior to this PR, we do not create device guard for NopKernelSchedulerNode. This leads to an error if empty_strided_cuda should create a tensor on cuda:1.

eellison

Looks good but a question on decode_device. where is the unnormalized device coming in ?

eellison · 2024-12-05T18:42:58Z

torch/_inductor/lowering.py

@@ -3027,7 +3027,7 @@ def _new_constant(
        dtype = decode_dtype(dtype) or x.get_dtype()
        device = device or x.get_device()
        size = [sympy.Integer(s) for s in size]
-        return _full(fill_value, device, dtype, size)
+        return _full(fill_value, decode_device(device), dtype, size)


Why do we have all the decode_device calls ? I would sort of expect this to be normalized already. or maybe we can normalize them in a more central place. it's going to be really easy to forget to do this

unnormalized device comes from dynamo. E.g., for this user code:

def f(x): return torch.ops.aten.empty_strided([2,3], [3, 1], device="cuda")

We will get the dynamo graph:

def forward(self): empty_strided: "f32[2, 3][3, 1]cuda:0" = torch.ops.aten.empty_strided([2, 3], [3, 1], device = 'cuda') return (empty_strided,)

Note that dynamo just copies device="cuda" from the user code. This unnormalized device is directly passed to aotautograd and finally to inductor.

I talked with @yanboliang on whether we should modify dynamo or inductor. Historically device is normalized in lowering.pywith decode_device. Now there are 9 callsites of decode_device in lowering.py on main branch. I think a better way might be removing all these decode_device and normalize during GraphLowering. I can have a separate PR for that.

Yea, i think we should just do this in GraphLowering class prior to invoking the op.. anyway we dont need to block this pr

eellison · 2024-12-06T18:47:58Z

torch/_inductor/lowering.py

@@ -3027,7 +3027,7 @@ def _new_constant(
        dtype = decode_dtype(dtype) or x.get_dtype()
        device = device or x.get_device()
        size = [sympy.Integer(s) for s in size]
-        return _full(fill_value, device, dtype, size)
+        return _full(fill_value, decode_device(device), dtype, size)


Yea, i think we should just do this in GraphLowering class prior to invoking the op.. anyway we dont need to block this pr

BoyuanFeng · 2024-12-06T18:50:40Z

@pytorchbot merge

pytorchmergebot · 2024-12-06T18:52:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-06T18:52:36Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

facebook-github-bot · 2024-12-06T19:22:37Z

@BoyuanFeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

BoyuanFeng · 2024-12-06T19:26:11Z

@pytorchbot merge -f "skip unrelated errors"

pytorchmergebot · 2024-12-06T19:27:38Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors. Prior to the PR: ```python def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args args.clear() assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1)) buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1 with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream1 = get_raw_stream(1) breakpoint() triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 return (buf1, ) ``` After the PR: ```python def call(args): arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args args.clear() assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1)) assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1)) assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1)) assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1)) with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16) # Topologically Sorted Source Nodes: [flex_attention], Original ATen: [] stream1 = get_raw_stream(1) triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1) del arg0_1 del arg1_1 del arg2_1 del arg3_1 del arg4_1 del arg5_1 del arg6_1 del buf0 return (buf1, ) ``` Fixes #141010 Pull Request resolved: #141372 Approved by: https://github.com/eellison

fix device error for NopKernelSchedulerNode

97ec52a

BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: inductor ciflow/inductor labels Nov 22, 2024

BoyuanFeng requested review from Chillee and drisspg November 22, 2024 19:53

skip if not multigpu

a8b1873

BoyuanFeng marked this pull request as draft November 23, 2024 06:27

BoyuanFeng added 4 commits November 26, 2024 13:32

Merge branch 'main' into bf/NopKernelSchedulerNode-device

5bfd5f1

fix a bug for add_device_info

e754b87

nit

809e4d8

add test in inductor

3cc2d00

BoyuanFeng commented Nov 27, 2024

View reviewed changes

BoyuanFeng added the ciflow/rocm Trigger "default" config CI on ROCm label Nov 27, 2024

BoyuanFeng added 3 commits December 4, 2024 10:21

fix device

11c683b

fix device

f94f3e4

Merge branch 'main' into bf/NopKernelSchedulerNode-device

f412f85

drisspg reviewed Dec 4, 2024

View reviewed changes

BoyuanFeng requested a review from eellison December 4, 2024 21:37

BoyuanFeng commented Dec 4, 2024

View reviewed changes

BoyuanFeng marked this pull request as ready for review December 4, 2024 21:40

eellison reviewed Dec 5, 2024

View reviewed changes

eellison approved these changes Dec 6, 2024

View reviewed changes

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot added the Merged label Dec 6, 2024

pytorchmergebot closed this in 61a7c83 Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

github-actions bot deleted the bf/NopKernelSchedulerNode-device branch January 6, 2025 02:08

[Inductor] fix device error for NopKernelSchedulerNode #141372

[Inductor] fix device error for NopKernelSchedulerNode #141372

Uh oh!

Conversation

BoyuanFeng commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141372

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

BoyuanFeng Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 27, 2024

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

eellison Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

eellison Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

eellison Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge failed

Uh oh!

facebook-github-bot commented Dec 6, 2024

Uh oh!

BoyuanFeng commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

Uh oh!

BoyuanFeng commented Nov 22, 2024 •

edited

Loading

pytorch-bot bot commented Nov 22, 2024 •

edited

Loading