[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

swolchok · 2024-10-14T17:44:58Z

Stack from ghstack (oldest at bottom):

This is the first big milestone we've been building towards!
(Following rev also hooks this up to actual gemv.)
Testing: To check perf, I ran python torchchat.py generate stories110M
--dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase.
Differential Revision: D64280688

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

This is the first big milestone we've been building towards! (TODO: also hook it up to GEMV in the same way fp16_gemv_trans is hooked up) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

pytorch-bot · 2024-10-14T17:45:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137918

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 06b6c11 with merge base 86602a6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-14T17:45:35Z

This pull request was exported from Phabricator. Differential Revision: D64280688

This is the first big milestone we've been building towards! (TODO: also hook it up to GEMV in the same way fp16_gemv_trans is hooked up) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! ghstack-source-id: 247859556 Pull Request resolved: #137918

jgong5

Do you have performance numbers?

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-10-15T18:31:08Z

This pull request was exported from Phabricator. Differential Revision: D64280688

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-10-17T22:34:17Z

This pull request was exported from Phabricator. Differential Revision: D64280688

swolchok · 2024-10-17T22:57:24Z

Do you have performance numbers?

it improves decoding performance about 5x for python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16.

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-10-22T16:39:17Z

This pull request was exported from Phabricator. Differential Revision: D64280688

jgong5 · 2024-10-29T07:58:29Z

aten/src/ATen/native/CPUBlas.cpp

+   // is to upconvert to fp32 and call sgemm. We can do better by
+   // fusing the conversion.
+   const bool fp16_gemv_trans_fast_path_would_be_beneficial =
+     cpuinfo_initialize() && cpuinfo_has_x86_f16c() && !cpuinfo_has_x86_avx512fp16();


I guess checking cpuinfo_has_x86_avx512fp16 is not necessary since onednn (mkldnn) won't use avx512fp16 to compute gemms by default because the avx512fp16 fma would incur accuracy loss.

@jgong5 https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/FP16-GEMM-using-AVX512-on-Sapphire-Rapids/m-p/1570739 doesn't seem to agree. I'm also confused -- I thought FMA was supposed to improve accuracy because it has high internal precision, so the result of the multiply doesn't have to be rounded to FP16 before the addition.

@swolchok The link you referred to is about MKL not oneDNN (or mkldnn). MKL has dedicated API (hgemm) that uses AVX512_FP16 instruction but users should be aware of the accuracy loss due to FP16 accumulators. It is not about the accumulator for a single FMA (which has high internal precision has you mentioned) but about accumulation along the K-dim across multiple FMAs. On the other hand, oneDNN uses FP32 accumulators to keep high accuracy.

oh, we use FP32 accumulation as well.

mkldnn_fp16_gemm despite the name is also available on ARM, it looks like your change will skip MKLDNN unless it is on x86 platform, wouldn't it? What is the motivation for it?

mkldnn_fp16_gemm despite the name is also available on ARM,

news to me!

it looks like your change will skip MKLDNN unless it is on x86 platform, wouldn't it?

fortunately no, because ARM machines won't pass cpuinfo_has_x86_f16c().

also available on ARM

I don't think so? https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/mkldnn/Utils.h#L120

something to look at for BF16 though.

oh, we use FP32 accumulation as well.

Will you remove this cpuinfo_has_x86_avx512fp16() check then? I don't think it is relevant.

remove this cpuinfo_has_x86_avx512fp16() check then? I don't think it is relevant.

I am surprised, but you're definitely the authority on this and I don't have a Sapphire Rapids machine to test on. I'll leave a note for posterity though.

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-10-29T17:47:34Z

This pull request was exported from Phabricator. Differential Revision: D64280688

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-10-31T17:41:23Z

This pull request was exported from Phabricator. Differential Revision: D64280688

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-10-31T17:54:30Z

This pull request was exported from Phabricator. Differential Revision: D64280688

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! [ghstack-poisoned]

facebook-github-bot · 2024-11-01T15:13:40Z

This pull request was exported from Phabricator. Differential Revision: D64280688

…rchitectures (#138005) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: #138005 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918

No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: #138275 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918, #138005

Caused by #137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)`

Caused by #137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)` Pull Request resolved: #139491 Approved by: https://github.com/huydhn, https://github.com/Skylion007

This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! Pull Request resolved: pytorch#137918 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082, pytorch#139083

…rchitectures (pytorch#138005) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: pytorch#138005 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082, pytorch#139083, pytorch#137918

No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: pytorch#138275 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082, pytorch#139083, pytorch#137918, pytorch#138005

Caused by pytorch#137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)` Pull Request resolved: pytorch#139491 Approved by: https://github.com/huydhn, https://github.com/Skylion007

facebook-github-bot added the fb-exported label Oct 14, 2024

swolchok requested review from malfet and jgong5 October 14, 2024 18:17

swolchok added the release notes: performance_as_product label Oct 14, 2024

jgong5 reviewed Oct 15, 2024

View reviewed changes

swolchok mentioned this pull request Oct 15, 2024

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed

swolchok mentioned this pull request Oct 17, 2024

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

This was referenced Oct 22, 2024

[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where #138486

Closed

[PyTorch] Fix inductor bug with unrolled vectorized prod #138542

Closed

jgong5 reviewed Oct 29, 2024

View reviewed changes

This was referenced Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

malfet approved these changes Oct 31, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 31, 2024

pytorchmergebot closed this in fad5d89 Nov 1, 2024

pytorchmergebot added the Merged label Nov 1, 2024

malfet added a commit that referenced this pull request Nov 1, 2024

Fix S390 builds

0da1090

Caused by #137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)`

malfet mentioned this pull request Nov 1, 2024

Fix S390 builds #139491

Closed

github-actions bot deleted the gh/swolchok/666/head branch December 2, 2024 02:13

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Uh oh!

Conversation

swolchok commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137918

✅ No Failures

Uh oh!

facebook-github-bot commented Oct 14, 2024

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 15, 2024

Uh oh!

facebook-github-bot commented Oct 17, 2024

Uh oh!

swolchok commented Oct 17, 2024

Uh oh!

facebook-github-bot commented Oct 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 29, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

facebook-github-bot commented Nov 1, 2024

Uh oh!

Uh oh!

swolchok commented Oct 14, 2024 •

edited

Loading

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading