CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[CINN] Add the TileTransposeTactic #70942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7dd8d89
to
55f1997
Compare
你的PR提交成功,感谢你对开源项目的贡献! |
3ab6558
to
a312cee
Compare
1abbe94
to
6423521
Compare
{bucket_info__1024_1M, tile_config__1024_1M}, | ||
{bucket_info__1M_INF, tile_config__1M_INF}}; | ||
{bucket_info__1024_INF, tile_config__1024_INF}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个bucket重复了,居然一直没人发现,会导致一个单测超时,我就把它俩合并了
PD_DEFINE_bool(cinn_enable_tile_transpose, | ||
BoolFromEnv("FLAGS_cinn_enable_tile_transpose", true), | ||
"Whether to enable the tile transpose tactic."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tactic这几个flag是否可以合并成一个类似FLAGS_cinn_disable_tactic="tile_broadcast;tile_transpose"
的形式?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以单独提一个PR来改,这些flag一般也不会用到,只有我对比性能的时候会用,我到时候直接在dy_shape_group_scheduler.cc创建tactic的时候就跳过这些被disable的tactic
sch->Split(shared_cache_block_id, offset + 1, {-1, 4, 8}); | ||
sch->Split(shared_cache_block_id, offset, {-1, 32}); | ||
|
||
sch->Split(local_cache_block_id, offset + 1, {-1, 32}); | ||
sch->Split(local_cache_block_id, offset, {-1, 4, 8}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
转换的维度小于32这里可以正常处理吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以,不整除的时候Split原语会自动加上if,一直都支持的
sch->Split(block_id, offset + 1, {-1, 32}); | ||
sch->Split(block_id, offset, {-1, 4, 8}); | ||
|
||
sch->Reorder(block_id, OffsetVec({0, 3, 1, 2, 4}, offset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于动态维度、不能整除32的维度和小于32的维度,可以用这里统一的流程处理吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个流程是统一的,没有区分是否整除或者动态shape,不整除或者动态shape的时候原语会自动加上if,就是thread x/y个数是固定的,但是如果用不满那么多thread,就会用if来限制
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thread用不满的情况下性能表现如何?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用NCHW->NHWC的场景测过,其实是否整除对性能影响不大,因为shape比较小的时候kernel launch的成本占大头,thread满不满影响不大;shape比较大的时候绝大部分block是用满的,均摊了thread用不满的部分;总之在我的测试里面静态shape带宽利用率能到90%以上,动态shape由于下标计算成本能到60~80%,但至少和原来比性能都没有回退
loop_vars.push_back(loop_var); | ||
|
||
ir::Var new_loop_var = ir::ir_utils::IRCopy(loop_var); | ||
new_loop_var->name = "loop_var_" + std::to_string(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop_var_
的通用前缀可以写成一个常量字符串,避免用的多了出现写错、漏改的情况
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
6423521
to
ad0024c
Compare
ad0024c
to
42c0cfe
Compare
PR Category
CINN
PR Types
Performance
Description
实现了基础版本的TileTransposeTactic,当前支持任意数量的对齐的Elementwise+Transpose
限制:
PR主要内容
tile_transpose_tactic.cc
:transpose调度模板实现,本PR的核心内容reindex_transpose_buffer_pass.cc
:对transpose的shared memory访存下标进行swizzle变换,消除bank conflict;让所有transpose共享一个union buffer,避免shared memory浪费storage.cc
:CacheRead原语重构,从而支持区分同一个tensor的不同下标访问实例示例
Pcard-85711