[PHI] fix paddle.incubate.softmax_mask_fuse for big tensor #73096

DanielSun11 · 2025-06-04T09:38:01Z

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

修复paddle.incubate.softmax_mask_fuse使其能支持big tensor。
对softmax_mask_fuse的前向和反向的kernel都进行了修改；

反向kernel只存在访存越界的问题，修复后对整体性能影响较小。
前向kernel存在访存越界和launch kernel时的非法配置问题，原有的实现为了兼顾更好的性能对block的分配比较固定，不能应对规模较大的情况。修复后的kernel更加通用，但是性能比不上原有的kernel。
为了兼顾性能和能够应对比较大规模的Tensor，当前采用两种cuda kernel并存的方案。当Tensor数据量比较小时仍采用原来的实现（命名为SoftmaxMaskFuseV1GPUKernel）以保证较好的性能。数据量较大时采用更通用的SoftmaxMaskFuseV2GPUKernel

存在问题1: 访存越界

成因:
前向和反向kernel中访问全局显存时使用的是int类型的offset，导致访问超过2^31以上的offset时出现溢出，导致出现非法地址。
修复方案：
将关键的索引部分替换为int64类型

存在问题2:launch kernel时的非法配置

成因：
launch kernel时 gird的各个维度的配置为dim3 blocks(query_seq_len / batches_per_block, attn_heads, batches);
由于gridDim.y/z的最大值为65535，因此当attn_heads或者batches超过此值时会报cuda error 9
修复方案：
计算总共需要的block数，然后只使用gridDim中的x维度（最大为2^32 - 1），在cuda kernel中重新计算mask Tensor对应的内存索引。
Grid中block的配置修复前后的对比

//修复前launch kernel时Grid中block的配置
//attn_head和batches都不能超过65536
 dim3 blocks(query_seq_len / batches_per_block, attn_heads, batches);
//修复后launch kernel时Grid中block的配置
//total_blocks = (query_seq_len / batches_per_block)*attn_heads*batches
int64_t total_blocks = batch_count / batches_per_block;
dim3 blocks(total_blocks);

CUDA Kernel中重新计算每个block的任务分配

//修复前的一些关键代码
  int data_first_idx =
      (blockDim.y *
           (blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z)) +
       threadIdx.y) *
      kLocalBatchSize;
//(blockIdx.x + gridDim.x * (blockIdx.y + gridDim.y * blockIdx.z)) 是在计算当前block 是所有Grid中所有block中的第几个block
  int mask_fist_idx =
      (blockDim.y * (blockIdx.x + gridDim.x * blockIdx.z) + threadIdx.y) *
      kLocalBatchSize;
//修复后对上述代码的一些等价转换
  u_int32_t blockInGrid = blockIdx.x;
  u_int32_t indexInMaskDim0 = blockInGrid / (attn_heads * query_seqs);
  u_int32_t indexInMaskDim2 = blockInGrid % (query_seqs);
  int64_t data_first_idx =
      (blockDim.y * static_cast<int64_t>(blockInGrid) + threadIdx.y) *
      kLocalBatchSize;
  int64_t mask_fist_idx =
      (blockDim.y * (indexInMaskDim2 +
                     static_cast<int64_t>(query_seqs) * indexInMaskDim0) +
       threadIdx.y) *
      kLocalBatchSize;
  //修复前后的一些对应关系
  // query_seqs <-> gridDim.x
  // attn_heads <-> gridDim.y
  // indexInMaskDim0 <-> blockIdx.z
  // indexInMaskDim2 <-> blockIdx.x

当前支持的Tensor的最大numel为 batches * attn_heads * query_seq_len * key_seq_len = （2^32 - 1 ）*8192 ，batches * attn_heads * query_seq_len最大值为（2^32 - 1 ），以float16 为例，此时需要的显存约为65536GB，可以应对绝大多数场景。
性能测试

Tensor config	修复前	修复后	性能变化
[1, 1, 33554432, 32],"float16"	10843	13640	-25.80%
[1, 1, 65536, 32],"float16"	26.498	31.865	-20.25%

通过nsight system和nsight compute进行深入的profiling发现。修改后的cuda kernel 每个thread所需的register由30涨到了37，进而导致warp的occupancy 下降，从而影响了性能。但是修复后的cuda kernel更加通用可以应对更大规模的Tensor。为了保证性能和支持Big Tensor，对Tensor中数据规模较小时，仍采用原有的cuda kernel（SoftmaxMaskFuseV1GPUKernel）以保证性能。当Tensor中数据规模较大时采用修复后的SoftmaxMaskFuseV2GPUKernel。

正确性测试

PaddleAPITest回测
添加单测case用以验证mask和x 在shape不同时的计算结果的正确性

pcard-67164

paddle-bot · 2025-06-04T09:38:05Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

DanielSun11 · 2025-06-04T16:20:22Z

/re-run all-failed

DanielSun11 · 2025-06-08T16:37:18Z

/re-run all-failed

wanghuancoder

LGTM

…dle#73096) * fix fused_softmax_mask and its grad for big tensor * fix forward kernel block config * using v1 and v2 kernel to keep the performance * fix data type for windows

DanielSun11 added 2 commits June 3, 2025 20:58

fix fused_softmax_mask and its grad for big tensor

631f44c

fix forward kernel block config

a4c26e4

DanielSun11 and others added 2 commits June 5, 2025 19:14

Merge branch 'PaddlePaddle:develop' into fix_fused_softmax_mask

eda9335

using v1 and v2 kernel to keep the performance

68cb7e9

fix data type for windows

7306a99

lshpku approved these changes Jun 12, 2025

View reviewed changes

wanghuancoder approved these changes Jun 12, 2025

View reviewed changes

lshpku merged commit d637d5b into PaddlePaddle:develop Jun 12, 2025
49 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PHI] fix paddle.incubate.softmax_mask_fuse for big tensor #73096

[PHI] fix paddle.incubate.softmax_mask_fuse for big tensor #73096

Uh oh!

DanielSun11 commented Jun 4, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jun 4, 2025

Uh oh!

DanielSun11 commented Jun 4, 2025

Uh oh!

DanielSun11 commented Jun 8, 2025

Uh oh!

wanghuancoder left a comment

Uh oh!

Uh oh!

Uh oh!

[PHI] fix paddle.incubate.softmax_mask_fuse for big tensor #73096

[PHI] fix paddle.incubate.softmax_mask_fuse for big tensor #73096

Uh oh!

Conversation

DanielSun11 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

存在问题1: 访存越界

存在问题2:launch kernel时的非法配置

Uh oh!

paddle-bot bot commented Jun 4, 2025

Uh oh!

DanielSun11 commented Jun 4, 2025

Uh oh!

DanielSun11 commented Jun 8, 2025

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DanielSun11 commented Jun 4, 2025 •

edited

Loading