Add host-side TMA support to AOTInductor #138878

aakhundov · 2024-10-25T01:12:12Z

Stack from ghstack (oldest at bottom):

-> Add host-side TMA support to AOTInductor #138878

This adds host-side Triton TMA support to AOTInductor. Notes:

Two helper functions, init1DTMADescriptor and init2DTMADescriptor are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific).
C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions.
Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen.
This PR concludes the host-side Triton TMA support in PT2.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-10-25T01:12:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138878

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e417138 with merge base 72ea7ba ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: c9551a1 Pull Request resolved: #138878

torch/_inductor/codegen/wrapper.py

desertfire · 2024-10-25T20:40:35Z

torch/_inductor/codegen/cpp_wrapper_gpu.py

@@ -251,7 +255,29 @@ def generate_user_defined_triton_kernel(
        )

    def generate_tma_descriptor(self, desc):


Because AOTI still uses two-pass run at the moment, does the cpp wrapper codegen here need to read any information from the python run? I have added things like DeferredGpuGridLine for that purpose. Do you think it is necessary to do something like that here?

I believe, the Python and C++ codegens for the TMA descriptor creation are independent.

[ghstack-poisoned]

ghstack-source-id: ed091c0 Pull Request resolved: #138878

aakhundov · 2024-10-28T23:32:34Z

@pytorchbot merge

pytorchmergebot · 2024-10-28T23:34:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This adds host-side Triton TMA support to AOTInductor. Notes: - Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific). - C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions. - Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen. - This PR concludes the host-side Triton TMA support in PT2. Pull Request resolved: pytorch#138878 Approved by: https://github.com/desertfire, https://github.com/chenyang78 ghstack dependencies: pytorch#138759, pytorch#138877

Update

188cd8a

[ghstack-poisoned]

aakhundov mentioned this pull request Oct 25, 2024

Fix typos in CreateTMADescriptorVariable #138877

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 25, 2024

aakhundov added a commit that referenced this pull request Oct 25, 2024

Add host-side TMA support to AOTInductor

354ff90

ghstack-source-id: c9551a1 Pull Request resolved: #138878

aakhundov marked this pull request as draft October 25, 2024 01:13

aakhundov added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Oct 25, 2024

aakhundov requested review from desertfire, chenyang78 and embg October 25, 2024 04:51

aakhundov marked this pull request as ready for review October 25, 2024 04:52

desertfire reviewed Oct 25, 2024

View reviewed changes

Update

e417138

[ghstack-poisoned]

aakhundov added a commit that referenced this pull request Oct 27, 2024

Add host-side TMA support to AOTInductor

b3bdd64

ghstack-source-id: ed091c0 Pull Request resolved: #138878

aakhundov requested a review from desertfire October 27, 2024 23:24

desertfire approved these changes Oct 28, 2024

View reviewed changes

chenyang78 approved these changes Oct 28, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 28, 2024

pytorchmergebot added the Merged label Oct 28, 2024

pytorchmergebot closed this in ab09c4d Oct 28, 2024

pytorchmergebot removed the merging label Oct 28, 2024

github-actions bot deleted the gh/aakhundov/14/head branch November 28, 2024 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add host-side TMA support to AOTInductor #138878

Add host-side TMA support to AOTInductor #138878

Uh oh!

aakhundov commented Oct 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

desertfire Oct 25, 2024

Uh oh!

aakhundov Oct 26, 2024

Uh oh!

aakhundov commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

Uh oh!

		@@ -251,7 +255,29 @@ def generate_user_defined_triton_kernel(
		)

		def generate_tma_descriptor(self, desc):

Add host-side TMA support to AOTInductor #138878

Add host-side TMA support to AOTInductor #138878

Uh oh!

Conversation

aakhundov commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138878

✅ No Failures

Uh oh!

Uh oh!

desertfire Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

aakhundov Oct 26, 2024

Choose a reason for hiding this comment

Uh oh!

aakhundov commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Merge started

Uh oh!

Uh oh!

aakhundov commented Oct 25, 2024 •

edited

Loading

pytorch-bot bot commented Oct 25, 2024 •

edited

Loading