CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[inductor] don't fuse two nodes if likely increase peak memory #138756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138756
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2c40d7e with merge base e6ff07f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…ak memory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing tests
Does this impact compile time at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have seen other cases of inductor increasing memory use, recently, see internal link. I would rather we do the full solution, i.e., with tensor liveness ranges, peak memory calculation, etc.
I imagine the fully solution can be INCREMENTALLY built upon this one.
|
…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that compile time regression CI real?
…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
|
…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@shunting314 curios if the error went by itself after rebase or if you had to change something in the code. |
@laithsakka the error gone by itself after rebasing. |
PASS: benchmark ('add_loop_inductor', 'compile_time_instruction_count') pass, actual result 24603225123 +1.41% is within expected 24260000000 ±1.50% PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 40744976754 +0.90% is within expected 40380000000 ±2.50% PASS: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') pass, actual result 23331151962 +1.40% is within expected 23010000000 ±1.50% |
see comments end of #138756 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames [ghstack-poisoned]
see comments end of #138756 I am also refreshing all values Pull Request resolved: #139703 Approved by: https://github.com/bobrenjc93
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: #138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #139136, #138756
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: pytorch#138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#139136, pytorch#138756
…ch#138756) Partially fixing pytorch#138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` Pull Request resolved: pytorch#138756 Approved by: https://github.com/jansel ghstack dependencies: pytorch#139136
see comments end of pytorch#138756 I am also refreshing all values Pull Request resolved: pytorch#139703 Approved by: https://github.com/bobrenjc93
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: pytorch#138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#139136, pytorch#138756
see comments end of pytorch#138756 I am also refreshing all values Pull Request resolved: pytorch#139703 Approved by: https://github.com/bobrenjc93
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: pytorch#138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#139136, pytorch#138756
Stack from ghstack (oldest at bottom):
Partially fixing #138685
Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.
The doc string mainly explains what this PR is doing:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov