[CINN] Backend supports the Welford variance algorithm #71057

lshpku · 2025-02-08T10:26:54Z

PR Category

CINN

PR Types

Improvements

Description

在CINN后端实现通过Welford算法计算方差，对应前端的pd_op.variance算子

注：当前仅支持通过BatchNorm的组合算子调用variance算子，不支持前端用户直接调用

示例程序

import paddle
import numpy as np
batch_norm = paddle.nn.BatchNorm(64, data_layout='NHWC')
@paddle.jit.to_static(full_graph=True, backend='CINN')
def fn_batch_norm(x):
    return batch_norm(x)
x = paddle.randn([32, 28, 28, 64]) + 1e2
x.stop_gradient = False
welford_out = fn_batch_norm(x)  # 使用Welford算法
mean_x = paddle.mean(x, axis=(0, 1, 2))
var_x = paddle.mean((x - mean_x) ** 2, axis=(0, 1, 2))
two_pass_out = (x - mean_x) * paddle.rsqrt(var_x + batch_norm._epsilon)  # Two-Pass的结果一般认为是金标准
var_x = paddle.mean(x * x, axis=(0, 1, 2)) - mean_x ** 2
one_pass_out = (x - mean_x) * paddle.rsqrt(var_x + batch_norm._epsilon)  # CINN原本的One-Pass算法
np.testing.assert_allclose(two_pass_out.numpy(), welford_out.numpy(), rtol=1e-3, atol=1e-3)  # 不会报错～
np.testing.assert_allclose(two_pass_out.numpy(), one_pass_out.numpy(), rtol=1e-3, atol=1e-3)  # 会报错！

生成的CUDA代码：

__global__
void __launch_bounds__(512) fn_variance_kernel(
  const float* __restrict__ var,
  float* __restrict__ var_0,
  float* __restrict__ var_1,
  float* __restrict__ var_0_rf,
  welford_fp32* __restrict__ var_1_rf,
  int32_t* __restrict__ semaphore)
{
  float var_0_rf [ 1 ];
  welford_fp32 var_1_rf [ 1 ];
  __shared__ welford_fp32 shm32__welford_fp32_reduce [ 512 ];
  __shared__ float shm32__fp32_reduce [ 512 ];
  bool is_last_block_done [ 1 ];
  var_1_rf[0] = welford_fp32(0.00000000f, 0.00000000f, 0.00000000f);
  var_0_rf[0] = 0.00000000f;
  for (int32_t k = 0; k < 49; k += 1) {
    float var_local = var[(((k * 32 + blockIdx.y) * 16 + threadIdx.y) * 2 + blockIdx.x) * 32 + threadIdx.x];
    var_1_rf[0] = var_1_rf[0] + (welford_fp32)var_local;  // 计算variance(x)
    var_0_rf[0] = var_0_rf[0] + var_local;                // 计算sum(x)，实际上是为了计算mean(x)
  }
  var_0_rf[(blockIdx.y * 2 + blockIdx.x) * 32 + threadIdx.x] = cinn_discrete_reduce_sum_fp32(var_0_rf[0], shm32__fp32_reduce);
  var_1_rf[(blockIdx.y * 2 + blockIdx.x) * 32 + threadIdx.x] = cinn_discrete_reduce_sum_welford_fp32(var_1_rf[0], shm32__welford_fp32_reduce);
  is_last_block_done[0] = cinn_grid_reduce_update_semaphore(semaphore);
  if (is_last_block_done[0]) {
    var_0[blockIdx.x * 32 + threadIdx.x] = cinn_grid_reduce_sum_fp32(var_0_rf, 64, ((blockIdx.x * 32) + threadIdx.x));
    var_1[blockIdx.x * 32 + threadIdx.x] = (float)cinn_grid_reduce_sum_welford_fp32(var_1_rf, 64, ((blockIdx.x * 32) + threadIdx.x));
  }
}

性能问题

Welford会造成一定的性能回退！

对于float32，经测试任何shape下性能回退均小于0.5%，可以认为比较安全
对于float16，性能回退在1~3%之间，shape越小回退越明显，因为Welford的计算量较大，数据量小时无法覆盖计算延迟

然而也不用过度担心，因为上述指的都是单就BatchNorm而言，但在模型里面BatchNorm用时占比一般不超过30%，哪怕是占比特别大的ResNet50，经测试也没有发现可见的ips回退，仅能通过nsys统计看出CINN的Kernel占比增加了1%

之后如果对性能有更高需求，可以再写一个pass，把variance和mean的计算合在一起，因为Welford其实同时算出了variance和mean，但当前mean是重复算的，合在一起可以省一个寄存器

Pcard-85711

paddle-bot · 2025-02-08T10:26:59Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot · 2025-02-20T03:03:17Z

Sorry to inform you that 0a3ea59's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

zyfncg · 2025-02-27T11:20:34Z

test/prim/model/test_resnet_prim.py

+        np.testing.assert_allclose(
+            dy2st_prim, standard_prim, rtol=2e-2, atol=1e-2
+        )


这里的精度差异为什么这么大？

这个比较的是训练10个step后的loss，确实会有比较大的偏差，我比较过动态图、组合算子、CINN，每个都跑出来不一样的数，所以很难说哪个是标准的，只能把标准放宽一点

…71057)

lshpku force-pushed the backend-welford-reduce branch from 0a3ea59 to d730045 Compare February 24, 2025 11:11

lshpku changed the title ~~[CINN] Support pd_op.reduce_var with Welford algorithm in backend (demo)~~ [CINN] Backend supports the Welford variance algorithm Feb 24, 2025

lshpku force-pushed the backend-welford-reduce branch 2 times, most recently from 4e848f9 to b0203aa Compare February 26, 2025 15:44

[CINN] Backend supports the Welford variance algorithm

68ea52f

lshpku force-pushed the backend-welford-reduce branch from b0203aa to 68ea52f Compare February 27, 2025 05:18

zyfncg approved these changes Feb 27, 2025

View reviewed changes

lshpku merged commit c750413 into PaddlePaddle:develop Feb 27, 2025
32 checks passed

Enigmatisms pushed a commit to Enigmatisms/Paddle that referenced this pull request Mar 6, 2025

[CINN] Backend supports the Welford variance algorithm (PaddlePaddle#…

c3c305f

…71057)

Enigmatisms mentioned this pull request Apr 1, 2025

[CINN] Realize composite reduce type in optim pass #72010

Merged

YqGe585 pushed a commit to YqGe585/Paddle that referenced this pull request May 7, 2025

[CINN] Backend supports the Welford variance algorithm (PaddlePaddle#…

ff57531

…71057)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CINN] Backend supports the Welford variance algorithm #71057

[CINN] Backend supports the Welford variance algorithm #71057

Uh oh!

lshpku commented Feb 8, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Feb 8, 2025

Uh oh!

paddle-ci-bot bot commented Feb 20, 2025

Uh oh!

zyfncg Feb 27, 2025

Uh oh!

lshpku Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

[CINN] Backend supports the Welford variance algorithm #71057

[CINN] Backend supports the Welford variance algorithm #71057

Uh oh!

Conversation

lshpku commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

示例程序

性能问题

Uh oh!

paddle-bot bot commented Feb 8, 2025

Uh oh!

paddle-ci-bot bot commented Feb 20, 2025

Uh oh!

zyfncg Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

lshpku Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lshpku commented Feb 8, 2025 •

edited

Loading