Add host-side Triton TMA support to Inductor #137950

aakhundov · 2024-10-15T01:18:52Z

Stack from ghstack (oldest at bottom):

-> Add host-side Triton TMA support to Inductor #137950

This adds Dynamo tracing support for the host-side Triton TMA API (see create_2d_tma_descriptor calls on the host in the Triton tutorial). A few notes:

Here we assume the availability of the host-side TMA API added to upstream Triton in [nvidia] Support passing TMA descriptors by-value triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
Due to Dynamo support implemented in the previous PR, the tma_descriptor_metadata dict is delivered to the triton_kerenl_wrap_ lowering and passed to the ir.UserDefinedTritonKernel as additional argument.
Looking into the tma_descriptor_metadata, ir.UserDefinedTritonKernel substitutes the corresponding TensorBox arguments of the kernel (swapped upstream in Dynamo) by the new ir.TMADescriptor nodes implementing TMA descriptors in Inductor IR.
ir.TMADescriptor.__init__ provides the wiring between the upstream underlying ir.TensorBox and the downstream ir.UserDefinedTritonKernel kernel. In particular, we use ir.NonOwnedLayout wrapping ir.ReinterpretView to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel).
Via ir.TMADescriptor.codegen, the Triton's create_{1d,2d}_tma_descriptor function call is codegened in the wrapper (in the host code).
New TMADescriptorArg dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA.
AOT Inductor support will be implemented in a follow-up PR.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

[ghstack-poisoned]

pytorch-bot · 2024-10-15T01:18:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137950

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0d93a4e with merge base 4a8e493 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Details TBA ghstack-source-id: 3ec0a5d Pull Request resolved: #137950

eellison

Looks good ! would you mind adding a couple symint uses ? Also, would be cool to have a deduping mechanism for tma_descriptor on same tensor

torch/_inductor/ir.py

eellison · 2024-10-16T23:51:30Z

torch/_inductor/codegen/cpp_wrapper_gpu.py

@@ -262,6 +262,9 @@ def generate_user_defined_triton_kernel(
            autotune_configs=configs,
        )

+    def generate_tma_descriptor(self, desc):


cc @desertfire , @benjaminglass1

torch/_inductor/ir.py

eellison · 2024-10-17T00:07:39Z

torch/_inductor/ir.py

+        constant_args = [
+            *self.dims,
+            *self.block_dims,
+            self.element_size,
+        ]


Is dims here actually constant ?

Not necessarily, they can be SymInts. Was following what the UserDefinedTritonKernel does, where all non-TensorBox args are put into the constant_args here.

I was unsure about the semantics of the constant_args parameter of ExternKernel. Looking into the code, seems self.constant_args is mostly used in the codegen-related methods, which are not relevant for TMADescriptor (neither for UserDefinedTritonKernel), as the codegen is overridden at the root. Although, I also see it being used as a potential source of unbacked SymInts here. So perhaps I should keep this code as is?

eellison · 2024-10-17T00:13:04Z

torch/_inductor/ir.py

+            # link back to the underlying tensor in terms of ownership
+            # to avoid getting the underlying tensor deleted *before*
+            # the TMADescriptor node can be deleted.
+            NonOwningLayout(ReinterpretView(tensor, tensor.get_layout())),


I guess this works and I can't think of anything better - but if this comes up as a common pattern maybe we can loop back

aakhundov · 2024-10-17T00:36:25Z

would you mind adding a couple symint uses ?

You mean test cases? The unit tests run with dynamic=True introduce SymInts into the dims passed to the create_{1d,2d}_tma_descriptor calls. Would that be sufficient?

Also, would be cool to have a deduping mechanism for tma_descriptor on same tensor

TMA descriptors are immutable, so this shouldn't be hard to do: would just need hashing on the underlying tensor and all the args. Let me try.

[ghstack-poisoned]

Details TBA ghstack-source-id: a248dd1 Pull Request resolved: #137950

aakhundov · 2024-10-17T01:35:55Z

torch/_inductor/ir.py

+        block_dims: List[Union[int, torch.SymInt]],
+        element_size: Optional[int] = None,
+    ):
+        key = (id(tensor), dims, block_dims, element_size)


@eellison I'm using id(tensor) here because TensorBox happens to be non-hashable. Although this looks correct, but it's very restrictive: we require the same TensorBox Python object to hit the cache. Is there a more canonical way to do this in Inductor IR (that would allow, e.g., different TensorBoxes referring to the same underlying storage)? Maybe I should unwrap storage before doing id(...), or can this ignore offsets in the views which can lead to different data_ptr() values?

You could try to cover different Tensors with same strides, same underlying storage, but even then, would require us to fix the layout. And not sure how common that is. I think this is good.

aakhundov · 2024-10-18T04:00:50Z

Landing this as the signals look good and all comments resolved. Happy to address further requests in a follow-up PR.

aakhundov · 2024-10-18T04:00:55Z

@pytorchbot merge

pytorchmergebot · 2024-10-18T04:02:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: pytorch#137950 Approved by: https://github.com/eellison ghstack dependencies: pytorch#137677

pytorchmergebot · 2024-10-18T06:28:29Z

This PR (#137950) was merged in d116d00 but it is still open, likely due to a Github bug, so mergebot is closing it manually. If you think this is a mistake, please feel free to reopen and contact Dev Infra.

Update

a99c4d9

[ghstack-poisoned]

aakhundov mentioned this pull request Oct 15, 2024

Add host-side Triton TMA support to Dynamo #137677

Closed

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels Oct 15, 2024

aakhundov marked this pull request as draft October 15, 2024 01:22

aakhundov added the topic: not user facing topic category label Oct 15, 2024

Update

112b426

[ghstack-poisoned]

aakhundov marked this pull request as ready for review October 15, 2024 04:38

aakhundov requested review from eellison and oulgen October 15, 2024 04:38

Update

deeb5de

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Oct 15, 2024

Add host-side Triton TMA support to Inductor

38b4060

Details TBA ghstack-source-id: 3ec0a5d Pull Request resolved: #137950

eellison reviewed Oct 17, 2024

View reviewed changes

eellison approved these changes Oct 17, 2024

View reviewed changes

eellison reviewed Oct 17, 2024

View reviewed changes

Update

0d93a4e

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Oct 17, 2024

Add host-side Triton TMA support to Inductor

aef438c

Details TBA ghstack-source-id: a248dd1 Pull Request resolved: #137950

aakhundov commented Oct 17, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 18, 2024

pytorchmergebot added the merging label Oct 18, 2024

pytorchmergebot added the Merged label Oct 18, 2024

pytorchmergebot closed this Oct 18, 2024

pytorchmergebot removed the merging label Oct 18, 2024

github-actions bot deleted the gh/aakhundov/11/head branch November 18, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add host-side Triton TMA support to Inductor #137950

Add host-side Triton TMA support to Inductor #137950

Uh oh!

aakhundov commented Oct 15, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading

Uh oh!

eellison left a comment

Uh oh!

Uh oh!

eellison Oct 16, 2024

Uh oh!

Uh oh!

eellison Oct 17, 2024

Uh oh!

aakhundov Oct 17, 2024

Uh oh!

eellison Oct 17, 2024

Uh oh!

aakhundov commented Oct 17, 2024

Uh oh!

aakhundov Oct 17, 2024

Uh oh!

eellison Oct 18, 2024 •

edited

Loading

Uh oh!

aakhundov commented Oct 18, 2024

Uh oh!

aakhundov commented Oct 18, 2024

Uh oh!

pytorchmergebot commented Oct 18, 2024

Uh oh!

pytorchmergebot commented Oct 18, 2024

Uh oh!

Uh oh!

Add host-side Triton TMA support to Inductor #137950

Add host-side Triton TMA support to Inductor #137950

Uh oh!

Conversation

aakhundov commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137950

✅ No Failures

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eellison Oct 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eellison Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

aakhundov Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

eellison Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

aakhundov commented Oct 17, 2024

Uh oh!

aakhundov Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

eellison Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aakhundov commented Oct 18, 2024

Uh oh!

aakhundov commented Oct 18, 2024

Uh oh!

pytorchmergebot commented Oct 18, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 18, 2024

Uh oh!

Uh oh!

aakhundov commented Oct 15, 2024 •

edited

Loading

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading

eellison Oct 18, 2024 •

edited

Loading