[Inference] Add multi-block mmha in FusedMT. #67211

Wanglongzhi2001 · 2024-08-08T06:34:11Z

PR Category

Inference

PR Types

New features

Description

pcard-71500
支持 FusedMT 中的 mmha 的 128K 上下文推理，开发了 multiblock 的 mmha kernel，期间使用 flash decoding 算法进行 head 的分块。
[NOTE]
单测里的上下文长度设置为 128K 可能导致单测耗时太长挂掉，因此单测的上下文长度仅设置成 2049 以及 layers 仅设置成 2 层。
[使用方式]
使用 export FLAGS_fused_multi_transformer_op_use_mbfmha =true 打开 FLAGS_fused_multi_transformer_op_use_mbfmha 后就行，无需额外操作，后续步骤和使用 FusedMT 的步骤一致。

paddle-bot · 2024-08-08T06:34:15Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot · 2024-08-16T03:03:06Z

Sorry to inform you that 11ac557's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

wwbitejotunn · 2024-08-20T03:53:14Z

paddle/phi/kernels/fusion/gpu/fused_multi_transformer_op.cu.h

  // 1.f / sqrt(Dh)
  float inv_sqrt_dh;
+  float inv_compression_ratio = 1.0f;


直接加上rope embedding的话这些参数是否还要保留呢?

wwbitejotunn · 2024-08-20T03:58:33Z

paddle/phi/kernels/fusion/gpu/mmha_util.cu.h

+    *reinterpret_cast<Vec*>(&src_vec) = src;
+#pragma unroll
+    for (int i = 0; i < VecSize; i++) {
+      src_vec[i] = src_vec[i] * static_cast<T>(scale);


这个是否先需要用float32计算再cast成T存储?

这个会有啥影响吗？

这个会有啥影响吗？

精度问题，scale 是 float 类型的，先向下 cast 成 T 类型再计算可能会掉精度。

paddle-ci-bot · 2024-08-31T03:06:49Z

Sorry to inform you that b4c66ed's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Aurelius84

LGTM for flags

Wanglongzhi2001 changed the title ~~[WIP] fmha support flash decoding~~ [Inference] fmha support flash decoding Aug 8, 2024

Wanglongzhi2001 changed the title ~~[Inference] fmha support flash decoding~~ [Inference] Add multi-block mmha in FusedMT. Aug 8, 2024

Wanglongzhi2001 force-pushed the mmha_128k branch from 11ac557 to d49809a Compare August 19, 2024 08:22

wwbitejotunn reviewed Aug 20, 2024

View reviewed changes

gongel mentioned this pull request Aug 21, 2024

[NOT Merge] FuseMT + 128K for compile #67330

Open

Wanglongzhi2001 force-pushed the mmha_128k branch from b4c66ed to 0df3244 Compare September 3, 2024 04:46

Wanglongzhi2001 added 12 commits September 3, 2024 12:56

fix typo

57cf7d4

fix rope

6238080

fix: fix accuracy and add test

b09187c

fix typo

e191adf

fix typo

01726b6

fix typo

f30d061

refactor test and remove todo in comments

f33fb91

fix test

496e65e

fix GQA and remove unnecessary params

2ced186

fix error

a761e4a

add gqa test

13c1d6f

resolve conflict and ignore the update of flash_attn

3bd31eb

Wanglongzhi2001 force-pushed the mmha_128k branch from 0df3244 to 3bd31eb Compare September 3, 2024 05:01

Wanglongzhi2001 added 2 commits September 3, 2024 15:15

fix the wrong usage of dev_ctx alloc

4f61ce5

fix the ignored code style

38f6646

yuanlehome approved these changes Sep 5, 2024

View reviewed changes

Aurelius84 approved these changes Sep 5, 2024

View reviewed changes

raindrops2sea approved these changes Sep 5, 2024

View reviewed changes

sneaxiy approved these changes Sep 5, 2024

View reviewed changes

carryyu merged commit 9b537dc into PaddlePaddle:develop Sep 5, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inference] Add multi-block mmha in FusedMT. #67211

[Inference] Add multi-block mmha in FusedMT. #67211

Uh oh!

Wanglongzhi2001 commented Aug 8, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Aug 8, 2024

Uh oh!

paddle-ci-bot bot commented Aug 16, 2024

Uh oh!

wwbitejotunn Aug 20, 2024

Uh oh!

Wanglongzhi2001 Aug 20, 2024

Uh oh!

wwbitejotunn Aug 20, 2024 •

edited

Loading

Uh oh!

Wanglongzhi2001 Aug 20, 2024

Uh oh!

gongel Aug 21, 2024

Uh oh!

Wanglongzhi2001 Aug 22, 2024 •

edited

Loading

Uh oh!

paddle-ci-bot bot commented Aug 31, 2024

Uh oh!

Aurelius84 left a comment

Uh oh!

Uh oh!

Uh oh!

[Inference] Add multi-block mmha in FusedMT. #67211

[Inference] Add multi-block mmha in FusedMT. #67211

Uh oh!

Conversation

Wanglongzhi2001 commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Aug 8, 2024

Uh oh!

paddle-ci-bot bot commented Aug 16, 2024

Uh oh!

wwbitejotunn Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

Wanglongzhi2001 Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

wwbitejotunn Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wanglongzhi2001 Aug 20, 2024

Choose a reason for hiding this comment

Uh oh!

gongel Aug 21, 2024

Choose a reason for hiding this comment

Uh oh!

Wanglongzhi2001 Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paddle-ci-bot bot commented Aug 31, 2024

Uh oh!

Aurelius84 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Wanglongzhi2001 commented Aug 8, 2024 •

edited

Loading

wwbitejotunn Aug 20, 2024 •

edited

Loading

Wanglongzhi2001 Aug 22, 2024 •

edited

Loading