[CINN] Add the TileTransposeTactic #70942

lshpku · 2025-01-22T05:41:57Z

PR Category

CINN

PR Types

Performance

Description

实现了基础版本的TileTransposeTactic，当前支持任意数量的对齐的Elementwise+Transpose

限制：

不支持Reduce
不支持非Elementwise的TrivialOp（即Reshape、Concat、Slice）
子图中所有Transpose必须是对齐的，即必须有相同的permutation

PR主要内容

tile_transpose_tactic.cc：transpose调度模板实现，本PR的核心内容
reindex_transpose_buffer_pass.cc：对transpose的shared memory访存下标进行swizzle变换，消除bank conflict；让所有transpose共享一个union buffer，避免shared memory浪费
storage.cc：CacheRead原语重构，从而支持区分同一个tensor的不同下标访问实例

示例

输入子图

x = paddle.randn([32, 128, 14, 14])
def func(x):
    x = x.transpose([0, 2, 3, 1])
    return x / (x + 1)

生成CUDA代码

__global__
void __launch_bounds__(256) fn_transpose_scale_divide_yield_store___kernel(
  const float* __restrict__ var /* 32,128,14,14 */,
  float* __restrict__ var_3 /* 32,14,14,128 */
) {
  __builtin_assume(blockIdx.x < 896);
  __builtin_assume(threadIdx.x < 32);
  __builtin_assume(threadIdx.y < 8);
  float _var_local_buffer [ 4 ];
  extern __shared__ uint8_t dyn_shared_buffer[];
  uint8_t *transpose_union_shm = (uint8_t*)&dyn_shared_buffer[ 0 ];
  float* var_shared_buffer = (float*)transpose_union_shm;
  float* var_local_buffer = _var_local_buffer;
  __syncthreads();
  for (int k = 0; k < 4; k += 1) {
    if (((((blockIdx.x % 28) / 4) * 32) + threadIdx.x) < 196) {
      float var_local = var[((((((k * 8) + threadIdx.y) + ((blockIdx.x & 3) * 32)) + ((blockIdx.x / 28) * 128)) * 196) + threadIdx.x) + (((blockIdx.x % 28) / 4) * 32)];
      var_shared_buffer[(((k * 8) + threadIdx.y) * 32) + cinn_nvgpu_bitwise_xor_int32(threadIdx.x, ((k * 8) + threadIdx.y))] = var_local;
    }
  }
  __syncthreads();
  for (int k = 0; k < 4; k += 1) {
    if (((((((blockIdx.x % 28) / 4) * 4) + k) * 8) + threadIdx.y) < 196) {
      var_local_buffer[k] = var_shared_buffer[(threadIdx.x * 32) + cinn_nvgpu_bitwise_xor_int32(((k * 8) + threadIdx.y), threadIdx.x)];
    }
  }
  for (int k = 0; k < 4; k += 1) {
    if (((((((blockIdx.x % 28) / 4) * 4) + k) * 8) + threadIdx.y) < 196) {
      var_3[((((((((blockIdx.x % 28) / 4) * 32) + (k * 8)) + threadIdx.y) + ((blockIdx.x / 28) * 196)) * 128) + ((blockIdx.x & 3) * 32)) + threadIdx.x] =
          (var_local_buffer[k] / (var_local_buffer[k] + 1.00000000f));
    }
  }
}

Pcard-85711

paddle-bot · 2025-01-22T09:29:32Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

lshpku · 2025-02-06T03:28:05Z

paddle/cinn/ir/group_schedule/config/group_tile_config.cc

-            {bucket_info__1024_1M, tile_config__1024_1M},
-            {bucket_info__1M_INF, tile_config__1M_INF}};
+            {bucket_info__1024_INF, tile_config__1024_INF}};


这两个bucket重复了，居然一直没人发现，会导致一个单测超时，我就把它俩合并了

zyfncg · 2025-02-06T08:30:28Z

paddle/cinn/runtime/flags.cc

+PD_DEFINE_bool(cinn_enable_tile_transpose,
+               BoolFromEnv("FLAGS_cinn_enable_tile_transpose", true),
+               "Whether to enable the tile transpose tactic.");


tactic这几个flag是否可以合并成一个类似FLAGS_cinn_disable_tactic="tile_broadcast;tile_transpose"的形式？

可以单独提一个PR来改，这些flag一般也不会用到，只有我对比性能的时候会用，我到时候直接在dy_shape_group_scheduler.cc创建tactic的时候就跳过这些被disable的tactic

zyfncg · 2025-02-06T15:47:25Z

paddle/cinn/ir/group_schedule/tactic/tile_transpose_tactic.cc

+  sch->Split(shared_cache_block_id, offset + 1, {-1, 4, 8});
+  sch->Split(shared_cache_block_id, offset, {-1, 32});
+
+  sch->Split(local_cache_block_id, offset + 1, {-1, 32});
+  sch->Split(local_cache_block_id, offset, {-1, 4, 8});


转换的维度小于32这里可以正常处理吗？

可以，不整除的时候Split原语会自动加上if，一直都支持的

zyfncg · 2025-02-06T15:53:03Z

paddle/cinn/ir/group_schedule/tactic/tile_transpose_tactic.cc

+  sch->Split(block_id, offset + 1, {-1, 32});
+  sch->Split(block_id, offset, {-1, 4, 8});
+
+  sch->Reorder(block_id, OffsetVec({0, 3, 1, 2, 4}, offset));


对于动态维度、不能整除32的维度和小于32的维度，可以用这里统一的流程处理吗？

这个流程是统一的，没有区分是否整除或者动态shape，不整除或者动态shape的时候原语会自动加上if，就是thread x/y个数是固定的，但是如果用不满那么多thread，就会用if来限制

thread用不满的情况下性能表现如何？

用NCHW->NHWC的场景测过，其实是否整除对性能影响不大，因为shape比较小的时候kernel launch的成本占大头，thread满不满影响不大；shape比较大的时候绝大部分block是用满的，均摊了thread用不满的部分；总之在我的测试里面静态shape带宽利用率能到90%以上，动态shape由于下标计算成本能到60~80%，但至少和原来比性能都没有回退

zyfncg · 2025-02-06T16:18:52Z

paddle/cinn/ir/ir_analyzer/ir_analyzer.cc

+    loop_vars.push_back(loop_var);
+
+    ir::Var new_loop_var = ir::ir_utils::IRCopy(loop_var);
+    new_loop_var->name = "loop_var_" + std::to_string(i);


loop_var_的通用前缀可以写成一个常量字符串，避免用的多了出现写错、漏改的情况

lshpku force-pushed the tile-transpose-tactic branch from 7dd8d89 to 55f1997 Compare January 22, 2025 08:53

lshpku force-pushed the tile-transpose-tactic branch 2 times, most recently from 3ab6558 to a312cee Compare January 26, 2025 03:32

lshpku force-pushed the tile-transpose-tactic branch 2 times, most recently from 1abbe94 to 6423521 Compare February 5, 2025 10:44

lshpku commented Feb 6, 2025

View reviewed changes

zyfncg reviewed Feb 6, 2025

View reviewed changes

lshpku force-pushed the tile-transpose-tactic branch from 6423521 to ad0024c Compare February 7, 2025 03:07

[CINN] Add the TileTransposeTactic

42c0cfe

lshpku force-pushed the tile-transpose-tactic branch from ad0024c to 42c0cfe Compare February 7, 2025 03:08

zyfncg approved these changes Feb 7, 2025

View reviewed changes

lshpku merged commit 1ecf5ee into PaddlePaddle:develop Feb 7, 2025
31 checks passed

YqGe585 pushed a commit to YqGe585/Paddle that referenced this pull request May 7, 2025

[CINN] Add the TileTransposeTactic (PaddlePaddle#70942)

bfc617b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CINN] Add the TileTransposeTactic #70942

[CINN] Add the TileTransposeTactic #70942

Uh oh!

lshpku commented Jan 22, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jan 22, 2025

Uh oh!

lshpku Feb 6, 2025

Uh oh!

zyfncg Feb 6, 2025

Uh oh!

lshpku Feb 7, 2025

Uh oh!

zyfncg Feb 6, 2025

Uh oh!

lshpku Feb 7, 2025

Uh oh!

zyfncg Feb 6, 2025

Uh oh!

lshpku Feb 7, 2025

Uh oh!

zyfncg Feb 7, 2025

Uh oh!

lshpku Feb 7, 2025 •

edited

Loading

Uh oh!

zyfncg Feb 6, 2025

Uh oh!

lshpku Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

[CINN] Add the TileTransposeTactic #70942

[CINN] Add the TileTransposeTactic #70942

Uh oh!

Conversation

lshpku commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

PR主要内容

示例

Uh oh!

paddle-bot bot commented Jan 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lshpku Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lshpku commented Jan 22, 2025 •

edited

Loading

lshpku Feb 7, 2025 •

edited

Loading