Add compiler bisector #131936

eellison · 2024-07-26T19:54:06Z

Stack from ghstack (oldest at bottom):

-> Add compiler bisector #131936

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper good | bad.

The bisector will first go through backends - eager, aot_eager, aot_eager_decomp_partition, inductor to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.

An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :

    from torch._inductor.bisect_helper import BisectionManager
    if op in CURRENT_DECOMPOSITION_TABLE:
        if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
            return NotImplemented

Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with TORCH_BISECT_BACKEND TORCH_BISECT_SUBSYSTEM and TORCH_BISECT_MAX.

We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.

Fix for #126546

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

[ghstack-poisoned]

pytorch-bot · 2024-07-26T19:54:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131936

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9235892 with merge base ac8954d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: d2b0ba7 Pull Request resolved: #131936

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 44cba4e Pull Request resolved: #131936

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 861b528 Pull Request resolved: #131936

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 62f3b61 Pull Request resolved: #131936

ezyang · 2024-08-01T02:49:30Z

cc @bhack

ezyang · 2024-08-01T02:50:47Z

Q without having reviewed the implementation: how quickly can I trigger this by changing trainer configuration on a MAST job, e.g., if I have a job that failed and I want to run the bisector directly in the job?

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 Edit: have test but accidentally didn't git add, will add cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: adf81ff Pull Request resolved: #131936

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 6b4c1bb Pull Request resolved: #131936

bhack · 2024-08-01T15:58:02Z

Do you have an end user/developer example on how this is going to be used to report a compiler failure?

eellison · 2024-08-01T16:00:24Z

@bhack check tests

ghstack-source-id: 394623f Pull Request resolved: #131936

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 211f925 Pull Request resolved: #131936

eellison · 2024-09-17T19:27:19Z

@pytorchbot merge

pytorchmergebot · 2024-09-17T19:29:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-18T01:27:45Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

eellison · 2024-09-18T04:00:05Z

@pytorchbot merge

pytorchmergebot · 2024-09-18T04:01:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-18T10:00:33Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

eellison · 2024-10-07T22:01:58Z

@pytorchbot merge

pytorchmergebot · 2024-10-07T22:03:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-07T22:03:53Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 71db3dd9410496ff5fc0428dd631e08498ea461a returned non-zero exit code 1

Auto-merging test/run_test.py
Auto-merging torch/_inductor/codecache.py
Auto-merging torch/_inductor/fx_passes/post_grad.py
CONFLICT (content): Merge conflict in torch/_inductor/fx_passes/post_grad.py
Auto-merging torch/_inductor/graph.py
Auto-merging torch/fx/experimental/proxy_tensor.py
error: could not apply 71db3dd941... Add compiler bisector
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: eb9150d Pull Request resolved: #131936

This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`. The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it. An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is : ``` from torch._inductor.bisect_helper import BisectionManager if op in CURRENT_DECOMPOSITION_TABLE: if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)): return NotImplemented ``` Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`. We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts. Fix for #126546 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

ghstack-source-id: 48a9be1 Pull Request resolved: #131936

eellison · 2024-10-09T18:08:27Z

@pytorchbot merge

pytorchmergebot · 2024-10-09T18:10:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Follow up to #131936. In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones. Pull Request resolved: #137346 Approved by: https://github.com/zou3519

[WIP] Add compiler bisector

bc56a72

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor release notes: fx release notes category labels Jul 26, 2024

Update on "[WIP] Add compiler bisector"

5cc2513

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Update on "[WIP] Add compiler bisector"

24beefe

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison added a commit that referenced this pull request Jul 26, 2024

[WIP] Add compiler bisector

685d89a

ghstack-source-id: d2b0ba7 Pull Request resolved: #131936

Update on "[WIP] Add compiler bisector"

835378f

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison added a commit that referenced this pull request Jul 27, 2024

[WIP] Add compiler bisector

61c7aee

ghstack-source-id: 44cba4e Pull Request resolved: #131936

Update on "[WIP] Add compiler bisector"

99cb02d

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

eellison added a commit that referenced this pull request Jul 31, 2024

Add compiler bisector

dfbe1f1

ghstack-source-id: 861b528 Pull Request resolved: #131936

eellison changed the title ~~[WIP] Add compiler bisector~~ Add compiler bisector Jul 31, 2024

eellison requested review from zou3519, bdhirsh and ezyang July 31, 2024 22:24

eellison added a commit that referenced this pull request Jul 31, 2024

Add compiler bisector

7b31f84

ghstack-source-id: 62f3b61 Pull Request resolved: #131936

eellison added a commit that referenced this pull request Aug 1, 2024

Add compiler bisector

65d6d2c

ghstack-source-id: adf81ff Pull Request resolved: #131936

eellison requested a review from a team as a code owner August 1, 2024 15:30

eellison added a commit that referenced this pull request Aug 1, 2024

Add compiler bisector

ced0240

ghstack-source-id: 6b4c1bb Pull Request resolved: #131936

eellison added a commit that referenced this pull request Sep 17, 2024

Add compiler bisector

fec2152

ghstack-source-id: 394623f Pull Request resolved: #131936

eellison added a commit that referenced this pull request Sep 17, 2024

Add compiler bisector

33ef987

ghstack-source-id: 211f925 Pull Request resolved: #131936

pytorchmergebot added the merging label Sep 17, 2024

eellison mentioned this pull request Oct 4, 2024

Search through config changes in compiler bisector #137346

Closed

pytorchmergebot removed the merging label Oct 7, 2024

eellison added a commit that referenced this pull request Oct 9, 2024

Add compiler bisector

96c1afa

ghstack-source-id: eb9150d Pull Request resolved: #131936

eellison added a commit that referenced this pull request Oct 9, 2024

Add compiler bisector

79624f8

ghstack-source-id: 48a9be1 Pull Request resolved: #131936

pytorchmergebot added the merging label Oct 9, 2024

pytorchmergebot added the Merged label Oct 9, 2024

pytorchmergebot closed this in 47af7cc Oct 9, 2024

pytorchmergebot removed the merging label Oct 9, 2024

github-actions bot deleted the gh/eellison/679/head branch November 9, 2024 02:04

Add compiler bisector #131936

Add compiler bisector #131936

Uh oh!

Conversation

eellison commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131936

✅ No Failures

Uh oh!

ezyang commented Aug 1, 2024

Uh oh!

ezyang commented Aug 1, 2024

Uh oh!

bhack commented Aug 1, 2024

Uh oh!

eellison commented Aug 1, 2024

Uh oh!

eellison commented Sep 17, 2024

Uh oh!

pytorchmergebot commented Sep 17, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 18, 2024

Uh oh!

eellison commented Sep 18, 2024

Uh oh!

pytorchmergebot commented Sep 18, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 18, 2024

Uh oh!

eellison commented Oct 7, 2024

Uh oh!

pytorchmergebot commented Oct 7, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 7, 2024

Merge failed

Uh oh!

eellison commented Oct 9, 2024

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge started

Uh oh!

Uh oh!

eellison commented Jul 26, 2024 •

edited

Loading

pytorch-bot bot commented Jul 26, 2024 •

edited

Loading