You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Performance Optimization
PR Types
New features
Description
一些算子如Conv的性能与输入数据的Layout紧密相关,这个PR的作用是针对这些Layout性能敏感的算子进行优化,实现了transfer_layout_pass,该pass的主要目标是确保对于所有确定在某个Layout下有明显性能优势的算子,将其Layout转换为目标Layout。在此基础上,本PR的主要改进和实现点有两个:
另外需要强调,本Pass假设所有输入起初都是NCHW的,如果某个输入一开始就是NHWC,可能会被错误改写,关于这一问题的详细说明,见Q&A部分。
方案
首先我们描述一下对这一问题的建模:在计算图中,存在以下三类节点,第一类是必须在NCHW下运行的节点,如用户输入输出;第二类是必须在NHWC运行的节点,如Conv/FusedConv;第三类是可以接受任意Layout的节点。于是我们的问题就是要对第三类节点染色,使得全局而言,在确保第二类节点都运行在NHWC下的同时,插入的transpose数量最少。
自然的,我们将这建模为一个最小割问题,其中,第一类节点和源点之间连一条权重无穷大的边,第二类节点和汇点之间连一条权重无穷大的边。接下来的问题,此时,最小割算法会将节点划分到两个集合,一个集合包含源点,这个集合的点会在NCHW下运行;另一个集合包含汇点,这个集合的点会在NHWC下运行。 割边的权重总和对应我们会插入transpose算子的数量,这里比较微妙的地方在于要确保在建图过程中正确将边权建模为:如果该边的两个端点Layout不同,插入的transpose算子数量。
性能测试
A30下测试SD1.5模型平均时延从 3517.366 ms下降到 2561.959 ms。更多实验待后续补充。
Q&A
由于很多算子的Infermeta都没有正确设置输出的layout,导致我们获取到的原Layout是不准确的。另外,对于matmul这样的算子,其输入一般是二维的,说它的输入是NCHW还是NHWC都是有问题的。因此这类算子我们一定不更改其Layout。
reshape类算子的Layout一般不能转换,但是像UNet中出现的 1 -> 1x1x1x32 这类情形是可以被转换,进而节省一个transpose的。此类优化后续单独提PR支持。
Others
Pcard-71500