【Hackathon 7th No.32】为 paddle.nn.functional.scaled_dot_product_attention 进行功能增强 #69099

yinfan98 · 2024-10-31T17:51:50Z

PR Category

User Experience

PR Types

Improvements

Description

修改paddle sdpa代码，支持后端 math、mem efficient、flash选择。并对齐torch选择代码的方式

paddle-bot · 2024-10-31T17:51:55Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhwesky2010 · 2024-11-05T09:02:05Z

@yinfan98

可以参考下paddle.nn.funcitonal.flash_attention.flash_attention 这个API的后端选择，其支持三种后端
目前比较大的问题是没有测试，所以不知道结果对不对，添加测试吧 test/legacy_test/test_flash_attention.py，这个你写入到单测里。然后还需要和torch做测试对比，这个你将结果截图证明就行

yinfan98 · 2024-11-06T13:31:38Z

可以参考下paddle.nn.funcitonal.flash_attention.flash_attention 这个API的后端选择，其支持三种后端

目前比较大的问题是没有测试，所以不知道结果对不对，添加测试吧 test/legacy_test/test_flash_attention.py，这个你写入到单测里。然后还需要和torch做测试对比，这个你将结果截图证明就行

好的谢谢！关于paddle的后端选择我看了，现在代码是针对padde的后端选择修改的。并同时修改了下mask相关。
我之后又针对torch观察了下sdpa的后端选择，我发现现在torch sdpa的可选项还挺多的。大概是：

CUDNN
Flash
- CPU
- GPU
Efficient
Override
Math

请问我这里有必要和torch的强制对齐吗，还是说让paddle flash_attention已经支持的后端在paddle 的sdpa下继续支持好就OK
@zhwesky2010

Reference: torch sdpa code

paddle-ci-bot · 2024-11-09T03:11:22Z

Sorry to inform you that 2f261aa's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

zhwesky2010 · 2024-11-13T03:51:41Z

可以参考下paddle.nn.funcitonal.flash_attention.flash_attention 这个API的后端选择，其支持三种后端

目前比较大的问题是没有测试，所以不知道结果对不对，添加测试吧 test/legacy_test/test_flash_attention.py，这个你写入到单测里。然后还需要和torch做测试对比，这个你将结果截图证明就行

好的谢谢！关于paddle的后端选择我看了，现在代码是针对padde的后端选择修改的。并同时修改了下mask相关。我之后又针对torch观察了下sdpa的后端选择，我发现现在torch sdpa的可选项还挺多的。大概是：

CUDNN

Flash

CPU

GPU

Efficient

Override

Math

请问我这里有必要和torch的强制对齐吗，还是说让paddle flash_attention已经支持的后端在paddle 的sdpa下继续支持好就OK @zhwesky2010

Reference: torch sdpa code

所以是支持torch其中的 Flash GPU、Efficient、Math三种对吧，剩余CUDNN、Override不支持。目前paddle里已经支持的后端需要对齐torch，其他现在没有的后端就不要了

yinfan98 · 2024-11-13T08:42:33Z

可以参考下paddle.nn.funcitonal.flash_attention.flash_attention 这个API的后端选择，其支持三种后端

目前比较大的问题是没有测试，所以不知道结果对不对，添加测试吧 test/legacy_test/test_flash_attention.py，这个你写入到单测里。然后还需要和torch做测试对比，这个你将结果截图证明就行

好的谢谢！关于paddle的后端选择我看了，现在代码是针对padde的后端选择修改的。并同时修改了下mask相关。我之后又针对torch观察了下sdpa的后端选择，我发现现在torch sdpa的可选项还挺多的。大概是：

CUDNN

Flash

CPU

GPU

Efficient

Override

Math

请问我这里有必要和torch的强制对齐吗，还是说让paddle flash_attention已经支持的后端在paddle 的sdpa下继续支持好就OK @zhwesky2010
Reference: torch sdpa code

所以是支持torch其中的 Flash GPU、Efficient、Math三种对吧，剩余CUDNN、Override不支持。目前paddle里已经支持的后端需要对齐torch，其他现在没有的后端就不要了

是的，了解了。谢谢～

…to develop

yinfan98 · 2024-11-14T18:31:15Z

已修复PR，添加：

不同后端的选择逻辑：
当CPU后端时，自动选择math。当fp32时，满足sm>=70选择mem_efficient(已看过mem_efficient绑定时可以接受float32)。其他case下，当sm>80时选择flash attn。当sm在[70, 80)时选择mem_efficient。其他case下选择math。
为math后端添加mask支持：
在q*k^T 。之后加一个mask结果，同时为mem_efficient后端添加mask支持，找到varlen的variable_length_memory_efficient_attention支持mask。做一个全长度的seq_lens传入即可。
单测：
在test里为CPU、fp32这两种情况添加单测。

烦请 @zhwesky2010 再帮忙review一下！谢谢

yinfan98 · 2024-11-22T18:09:22Z

CI已经通过，精度测试也过了，贴一下图：

顺便贴一下测试代码：

import torch
from torch.backends.cuda import sdp_kernel as torch_sdp_kernel
import paddle
from paddle.nn.functional import sdp_kernel as paddle_sdp_kernel
import numpy as np
from numpy.testing import assert_allclose
def create_float_attention_mask(batch_size, seq_lens, dtype='float16'):
    max_seq_len = max(seq_lens)
    
    # Paddle mask with 1e4
    paddle_mask = paddle.zeros([batch_size, 1, max_seq_len, max_seq_len], dtype=dtype)
    for i in range(batch_size):
        seq_len = seq_lens[i]
        mask = paddle.tril(paddle.ones(shape=(seq_len, seq_len), dtype=dtype)) - 1
        paddle_mask[i, 0, :seq_len, :seq_len] = mask * 1e4
    
    # Torch mask with 1e9
    torch_mask = paddle.zeros([batch_size, 1, max_seq_len, max_seq_len], dtype=dtype)
    for i in range(batch_size):
        seq_len = seq_lens[i]
        mask = paddle.tril(paddle.ones(shape=(seq_len, seq_len), dtype=dtype)) - 1
        torch_mask[i, 0, :seq_len, :seq_len] = mask * 1e9
    
    torch_mask = torch_mask.numpy()
    torch_mask = torch.from_numpy(torch_mask).to(dtype=torch.float16).cuda()
    
    return paddle_mask, torch_mask
def test_sdpa_alignment(batch_size=2, seq_len=1024, num_heads=8, head_dim=64):
    np.random.seed(42)
    torch.manual_seed(42)
    paddle.seed(42)
    
    query_np = np.random.randn(batch_size, seq_len, num_heads, head_dim).astype(np.float16)
    key_np = np.random.randn(batch_size, seq_len, num_heads, head_dim).astype(np.float16)
    value_np = np.random.randn(batch_size, seq_len, num_heads, head_dim).astype(np.float16)
    
    paddle_mask, torch_mask = create_float_attention_mask(batch_size, [seq_len] * batch_size, dtype='float16')
    
    query_paddle = paddle.to_tensor(query_np)
    key_paddle = paddle.to_tensor(key_np)
    value_paddle = paddle.to_tensor(value_np)
    
    query_torch = torch.from_numpy(query_np.transpose(0, 2, 1, 3)).cuda()
    key_torch = torch.from_numpy(key_np.transpose(0, 2, 1, 3)).cuda()
    value_torch = torch.from_numpy(value_np.transpose(0, 2, 1, 3)).cuda()
    
    # Flash Attention
    with paddle_sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False), \
         torch_sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
        try:
            paddle_flash = paddle.nn.functional.scaled_dot_product_attention(
                query_paddle, key_paddle, value_paddle, 
                attn_mask=paddle_mask, 
                dropout_p=0.0, 
                is_causal=False
            ).numpy()
            
            torch_flash = torch.nn.functional.scaled_dot_product_attention(
                query_torch, key_torch, value_torch, 
                attn_mask=None, 
                dropout_p=0.0,
                is_causal=True
            ).cpu().numpy().transpose(0, 2, 1, 3)
            
            print("\nComparing Flash Attention:")
            abs_diff = np.abs(paddle_flash - torch_flash)
            print(f"Maximum absolute difference: {np.max(abs_diff)}")
            print(f"Mean absolute difference: {np.mean(abs_diff)}")
            assert_allclose(paddle_flash, torch_flash, rtol=1e-2, atol=1e-2)
            print("Flash Attention results match within tolerance")
        except Exception as e:
            print(f"Flash Attention comparison failed: {e}")
    
    # Math Attention
    with paddle_sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False), \
         torch_sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False):
        try:
            paddle_math = paddle.nn.functional.scaled_dot_product_attention(
                query_paddle, key_paddle, value_paddle, 
                attn_mask=paddle_mask, 
                dropout_p=0.0, 
                is_causal=False
            ).numpy()
            
            torch_math = torch.nn.functional.scaled_dot_product_attention(
                query_torch, key_torch, value_torch, 
                attn_mask=torch_mask, 
                dropout_p=0.0,
                is_causal=False
            ).cpu().numpy().transpose(0, 2, 1, 3)
            
            print("\nComparing Math Attention:")
            abs_diff = np.abs(paddle_math - torch_math)
            print(f"Maximum absolute difference: {np.max(abs_diff)}")
            print(f"Mean absolute difference: {np.mean(abs_diff)}")
            assert_allclose(paddle_math, torch_math, rtol=1e-2, atol=1e-2)
            print("Math Attention results match within tolerance")
        except Exception as e:
            print(f"Math Attention comparison failed: {e}")
    
    # Memory Efficient Attention
    with paddle_sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True), \
         torch_sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True):
        try:
            paddle_mem = paddle.nn.functional.scaled_dot_product_attention(
                query_paddle, key_paddle, value_paddle, 
                attn_mask=paddle_mask, 
                dropout_p=0.0, 
                is_causal=False
            ).numpy()
            
            torch_mem = torch.nn.functional.scaled_dot_product_attention(
                query_torch, key_torch, value_torch, 
                attn_mask=torch_mask, 
                dropout_p=0.0,
                is_causal=False
            ).cpu().numpy().transpose(0, 2, 1, 3)
            
            print("\nComparing Memory Efficient Attention:")
            abs_diff = np.abs(paddle_mem - torch_mem)
            print(f"Maximum absolute difference: {np.max(abs_diff)}")
            print(f"Mean absolute difference: {np.mean(abs_diff)}")
            assert_allclose(paddle_mem, torch_mem, rtol=1e-2, atol=1e-2)
            print("Memory Efficient Attention results match within tolerance")
        except Exception as e:
            print(f"Memory Efficient Attention comparison failed: {e}")
if __name__ == "__main__":
    for seq_len in [512]:
        print(f"\nTesting with sequence length: {seq_len}")
        test_sdpa_alignment(
            batch_size=2, 
            seq_len=seq_len, 
            num_heads=8, 
            head_dim=64
        )

cc: @zhwesky2010

zhwesky2010

返回格式的transpose(0, 2, 1, 3)那里，是因为Tensor的layout不同吗，这个diff是合理的不

zhwesky2010 · 2024-11-25T06:59:33Z

python/paddle/nn/functional/flash_attention.py


-    if "xpu" in place:
+    if place == "XPU":


这个返回的是小写的吧：

在过xpu的ci的时候。。。我发现在xpu下，使用tensor的.place会返回一个“XPU”的string

zhwesky2010 · 2024-11-25T07:24:30Z

python/paddle/nn/functional/flash_attention.py

@@ -318,7 +374,7 @@ def flash_attention(

    """
    head_dim = query.shape[3]
-    sdp_func_name = _select_sdp(head_dim)
+    sdp_func_name = _select_sdp(head_dim, query.dtype, query.place)


query.place
这个取值范围不是XPU、CPU、GPU吧，上面的判断是怎么命中的

XPU、CPU、GPU这几个string是为了保险，有string的变量落入_select_sdp里的风险。在过CI的时候就会碰到，预期的形式肯定是能被.is_gpu_place()处理的这样的

zhwesky2010 · 2024-11-25T07:27:49Z

python/paddle/nn/functional/flash_attention.py

        return "flash_attn"

    # not use sdp_kernel
    if g_enable_flash is None:
-        if "gpu" not in place:
+        arch = _get_arch_info()


只有gpu的时候才需要_get_arch_info吧，这个可以挪到下面去

OK，已修改

zhwesky2010 · 2024-11-25T07:28:14Z

python/paddle/nn/functional/flash_attention.py

+
+        # handle bfloat16/fp16 case
+        elif place.is_gpu_place():
+            if dtype == paddle.bfloat16 or dtype == paddle.float16:


这几个分支的逻辑和torch都对齐了吗，是一致的吗

这里没对齐，我再参考一下torch的方案修改下

yinfan98 · 2024-11-26T04:08:46Z

返回格式的transpose(0, 2, 1, 3)那里，是因为Tensor的layout不同吗，这个diff是合理的不

是的，这个是合理的。paddle 的mem efficient 也需要transpose一下

在pytorch里 fa接受的输入为[bs, num_head, seq_length, head_dim]
但是paddle里 fa接受的输入为[bs, seq_length, num_head, head_dim]

在phi里paddle的fa应该是内部专置了，但是math和 mem efficient支持的都是[bs, num_head, seq_length, head_dim]，这种。所以它们俩手动专置了。

yinfan98 · 2024-11-26T11:17:56Z

math也是一样的

yinfan98 · 2024-11-27T08:26:43Z

@yinfan98

修改下PR描述，符合最新版本的代码

你上面的Pytorch测试代码，验证了能通过吗？需要贴下图。另外能否减少shape，以固定值测试的方式写到单测里，因为这个Pytorch测试代码没有上线还是存在风险的

pytorch还是能通过的，等我再截个图。第二个事我想想怎么搞一下

yinfan98 · 2024-11-27T08:54:11Z

@zhwesky2010 我想想好不好加一个test上来吧

yinfan98 · 2024-11-27T13:15:06Z

@zhwesky2010 test已加

test/legacy_test/test_flash_attention.py

zhwesky2010 · 2024-11-28T03:36:13Z

python/paddle/nn/functional/flash_attention.py

+
+    # not use sdp_kernel
+    if (
+        g_enable_flash is None


这三个变量是在哪里设置呢，好像用户也没办法指定？

有办法的

from paddle.nn import sdp_kernel with sdp_kernel( enable_flash=False, enable_math=True, enable_mem_efficient=False ):

这三个变量是在哪里设置呢，好像用户也没办法指定？

见最后一个test，就是这么改的。能手动控制使用哪个kernel打开或者默认让后端选择

zhwesky2010

LGTM

zhwesky2010 · 2024-11-28T05:06:01Z

@yinfan98 看看CI不通过的原因吧，不要引入不兼容风险。之前使用flash_attn后端计算的场景，没有改变吧，不然可能导致性能的改变

yinfan98 · 2024-11-28T05:31:36Z

@yinfan98 看看CI不通过的原因吧，不要引入不兼容风险。之前使用flash_attn后端计算的场景，没有改变吧，不然可能导致性能的改变

@zhwesky2010 没变 flash_attn 还是使用的之前的一套，后端选择也是一样的

zhwesky2010 · 2024-11-28T09:06:18Z

@yinfan98 看看CI不通过的原因吧，不要引入不兼容风险。之前使用flash_attn后端计算的场景，没有改变吧，不然可能导致性能的改变

@zhwesky2010 没变 flash_attn 还是使用的之前的一套，后端选择也是一样的

那合入吧

zhwesky2010 · 2024-11-28T09:06:38Z

LGTM

yinfan98 · 2024-11-29T09:48:29Z

@zhwesky2010 这个coverage挂了我不知道该怎么修一下，看起来好像不是因为我导致的。。。

zhwesky2010

LGTM

luotao1 · 2024-12-02T03:38:04Z

@yinfan98 修复下windows流水线

yinfan98 · 2024-12-02T08:29:28Z

rerun次数到了我整体重跑下吧。。

fix scaled dot product attention

2f261aa

paddle-bot bot added the contributor External developers label Oct 31, 2024

luotao1 mentioned this pull request Nov 1, 2024

【Hackathon 7th】开源贡献个人挑战赛 #68244

Closed

luotao1 added PaddlePaddle Hackathon API labels Nov 1, 2024

luotao1 assigned luotao1 and zhwesky2010 Nov 1, 2024

yinfan98 added 3 commits November 13, 2024 16:57

Merge branch 'PaddlePaddle:develop' into hackthon7/sdp

04c06c7

add test & fix

818f32a

Merge branch 'hackthon7/sdp' of https://github.com/yinfan98/Paddle in…

1221af4

…to develop

yinfan98 added 6 commits November 15, 2024 02:37

fix

e28299e

Merge branch 'PaddlePaddle:develop' into hackthon7/sdp

b96b839

fix ci

dae8b51

fix ci

5e41832

Update flash_attention.py

9af9663

Update flash_attention.py

0e5c8a9

assert paddle sdpa accuracy

40acb1f

zhwesky2010 reviewed Nov 25, 2024

View reviewed changes

Update flash_attention.py

05fb9f7

fix alignment for sdpa

dbd7461

add some torch test

74c9215

zhwesky2010 reviewed Nov 27, 2024

View reviewed changes

test/legacy_test/test_flash_attention.py Outdated Show resolved Hide resolved

yinfan98 added 2 commits November 27, 2024 23:01

delete same expected output

ab18981

Merge branch 'PaddlePaddle:develop' into hackthon7/sdp

a687e87

zhwesky2010 reviewed Nov 28, 2024

View reviewed changes

zhwesky2010 previously approved these changes Nov 28, 2024

View reviewed changes

Update flash_attention.py

537b092

yinfan98 dismissed zhwesky2010’s stale review via 537b092 November 29, 2024 02:09

yinfan98 added 3 commits November 29, 2024 12:23

Update flash_attention.py

3072dbf

Update flash_attention.py

1740a0d

Update test_flash_attention.py

ca24450

yinfan98 added 2 commits November 29, 2024 17:51

Merge branch 'PaddlePaddle:develop' into hackthon7/sdp

217ead0

Merge branch 'PaddlePaddle:develop' into hackthon7/sdp

8d9fa9e

zhwesky2010 previously approved these changes Dec 2, 2024

View reviewed changes

Update flash_attention.py

5a8fc46

yinfan98 dismissed zhwesky2010’s stale review via 5a8fc46 December 2, 2024 08:30

Merge branch 'PaddlePaddle:develop' into hackthon7/sdp

1555dbe

luotao1 approved these changes Dec 3, 2024

View reviewed changes

luotao1 merged commit a76839d into PaddlePaddle:develop Dec 3, 2024
28 checks passed

yinfan98 mentioned this pull request Dec 5, 2024

Revert "【Hackathon 7th No.32】为 paddle.nn.functional.scaled_dot_product_attention 进行功能增强" #69978

Merged

【Hackathon 7th No.32】为 paddle.nn.functional.scaled_dot_product_attention 进行功能增强 #69099

【Hackathon 7th No.32】为 paddle.nn.functional.scaled_dot_product_attention 进行功能增强 #69099

Uh oh!

Conversation

yinfan98 commented Oct 31, 2024 • edited by zhwesky2010 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Oct 31, 2024

Uh oh!

zhwesky2010 commented Nov 5, 2024

Uh oh!

yinfan98 commented Nov 6, 2024

Uh oh!

paddle-ci-bot bot commented Nov 9, 2024

Uh oh!

zhwesky2010 commented Nov 13, 2024

Uh oh!

yinfan98 commented Nov 13, 2024

Uh oh!

yinfan98 commented Nov 14, 2024

Uh oh!

yinfan98 commented Nov 22, 2024

Uh oh!

zhwesky2010 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinfan98 commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yinfan98 commented Nov 26, 2024

Uh oh!

yinfan98 commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yinfan98 commented Nov 27, 2024

Uh oh!

yinfan98 commented Nov 27, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhwesky2010 left a comment

Choose a reason for hiding this comment

Uh oh!

zhwesky2010 commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yinfan98 commented Nov 28, 2024

Uh oh!

zhwesky2010 commented Nov 28, 2024

Uh oh!

zhwesky2010 commented Nov 28, 2024

Uh oh!

yinfan98 commented Nov 29, 2024

Uh oh!

zhwesky2010 left a comment

Choose a reason for hiding this comment

yinfan98 commented Oct 31, 2024 •

edited by zhwesky2010

Loading

yinfan98 commented Nov 26, 2024 •

edited

Loading

yinfan98 commented Nov 27, 2024 •

edited

Loading

zhwesky2010 commented Nov 28, 2024 •

edited

Loading