[inductor][memory] restructuring memory.py and turn on the flag #137205

xuanzhang816 · 2024-10-02T19:27:01Z

Addressing additional comments given in PR #134874

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @rec

pytorch-bot · 2024-10-02T19:27:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137205

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3cba60a with merge base 10a34dc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xuanzhang816 · 2024-10-02T22:17:59Z

@pytorchbot label "topic: not user facing"

xuanzhang816 · 2024-10-03T14:02:02Z

torch/_inductor/memory.py

@@ -28,48 +29,35 @@ class MemoryPlanningInfoForBuffer:
    succ_nodes: OrderedSet[BaseSchedulerNode] = dataclasses.field(
        default_factory=OrderedSet
    )
-    outdegree: int = 0  # this is used only in topological_sort_lpmf


In MemoryPlanningInfoForBuffer, I am only keeping the attributes that are static. For dynamically set attributes (e.g., outdegree), will keep track of them during the function that needs them.

Similarly, removed e.g., memory_to_free and indegree from MemoryPlanningInfoForNode.

eellison

Looks great ! i had one concern about potential O(n^2) but turns out that is not a real issue (commenting on other pr for continuity). Can we please submit a dashboard run first ? if we are at all worried about riskiness of change it can also make sense to add a JK for enablement, or enable in OSS first then fbcode

eellison · 2024-10-07T21:29:49Z

torch/_inductor/config.py

@@ -259,7 +259,7 @@ def autotune_remote_cache_default() -> Optional[bool]:
 ]

 # enable operator reordering for peak memory optimization
-reorder_for_peak_memory = os.environ.get("TORCHINDUCTOR_REORDER_FOR_PEAK_MEMORY") == "1"


Did we do a dashboard run yet ? could we do that before landing ?

it's easiest to do via ghstack but i think you can also push to origin/main or something I forget exactly. would recommend ghstack.

If you've already submitted this pr, the easiest to do that is just to git reset --soft HEAD~1; git commit -m "tmp" ; ghstack and benchmark this commit as another pr.

Let me try to figure this out.

Sorry one other thing - we should test out the various implementations on ci runs as well, if possible. unless you already have clear idea that one is pareto optimal (compilation time, average memory change, worst memory change).

I submitted a run here. Seems that it takes a few days for the results to show up in the dashboard. I will report back once the results show up.

if we are at all worried about riskiness of change it can also make sense to add a JK for enablement, or enable in OSS first then fbcode

I was initially very concerned about this, but as I am trying this flag on for more internal models, I am becoming less concerned. Though there are many potential situations I am not aware of, so I think it would be the best to leave this decision to the PyTorch team. You guys are are the experts here :) One possibility is that once we turn it on, I can monitor it carefully for a while?

Seems all the dashboard run are done. This is the screenshot for the the summary (seems nothing is signficant).

Individual breakdowns can be found here:

gh/xuanzhang816/1/head -- using all three heuristics

torchbench

huggingface

timm

gh/xuanzhang816/2/head -- lpmf only

torchbench

huggingface

timm

gh/xuanzhang816/3/head -- bfs only

torchbench

huggingface

timm

gh/xuanzhang816/4/head -- dfs only

torchbench

huggingface

timm

torch/_inductor/memory.py

xuanzhang816 · 2024-10-23T19:18:13Z

@pytorchmergebot merge

pytorchmergebot · 2024-10-23T19:19:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-24T01:18:42Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

xuanzhang816 · 2024-10-24T04:25:17Z

@pytorchbot drci

xuanzhang816 · 2024-10-24T20:19:39Z

@pytorchmergebot merge

pytorchmergebot · 2024-10-24T20:21:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-24T21:30:29Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

xuanzhang816 · 2024-10-25T17:17:20Z

@pytorchmergebot merge -f "remaining tests queued for too long"

pytorchmergebot · 2024-10-25T17:19:20Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 2, 2024

pytorch-bot bot added the topic: not user facing topic category label Oct 2, 2024

xuanzhang816 commented Oct 3, 2024

View reviewed changes

xuanzhang816 marked this pull request as ready for review October 3, 2024 17:26

xuanzhang816 requested a review from eellison October 3, 2024 17:26

xuanzhang816 mentioned this pull request Oct 3, 2024

[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory #134874

Closed

xuanzhang816 changed the title ~~Restructuring the memory pass~~ [inductor][memory] restructuring memory.py Oct 4, 2024

pytorch-bot bot added the module: dynamo label Oct 4, 2024

xuanzhang816 changed the title ~~[inductor][memory] restructuring memory.py~~ [inductor][memory] restructuring memory.py and turn on the flag Oct 4, 2024

xuanzhang816 requested a review from yf225 October 4, 2024 22:01

eellison reviewed Oct 8, 2024

View reviewed changes

xuanzhang816 requested a review from eellison October 17, 2024 21:41

eellison approved these changes Oct 23, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 23, 2024

pytorchmergebot added the merging label Oct 23, 2024

xuanzhang816 added 8 commits October 24, 2024 13:18

restructuring

ff904d5

fix issues from rebase conflict

37e6cd4

modify output memory computation

0095bd8

fix test cases

418fa65

fix some mistakes

372f06a

lint

2a8df4d

turn on flag and fix test failures

ef26583

rename method

d046fea

xuanzhang816 force-pushed the orm_restructure branch from cd8531c to d046fea Compare October 24, 2024 20:19

pytorchmergebot removed the merging label Oct 24, 2024

fix additional test case failure

5e191b0

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 24, 2024

address more test failures

3cba60a

pytorch deleted a comment from pytorch-bot bot Oct 25, 2024

pytorchmergebot added the merging label Oct 25, 2024

pytorchmergebot closed this in 2980aed Oct 25, 2024

pytorchmergebot added Merged and removed merging labels Oct 25, 2024

[inductor][memory] restructuring memory.py and turn on the flag #137205

[inductor][memory] restructuring memory.py and turn on the flag #137205

Uh oh!

Conversation

xuanzhang816 commented Oct 2, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137205

✅ No Failures

Uh oh!

xuanzhang816 commented Oct 2, 2024

Uh oh!

xuanzhang816 Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

eellison Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

eellison Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

xuanzhang816 Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

eellison Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xuanzhang816 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

xuanzhang816 Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanzhang816 Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuanzhang816 commented Oct 23, 2024

Uh oh!

pytorchmergebot commented Oct 23, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 24, 2024

Uh oh!

xuanzhang816 commented Oct 24, 2024

Uh oh!

xuanzhang816 commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge failed

Uh oh!

xuanzhang816 commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge started

Uh oh!

Uh oh!

xuanzhang816 commented Oct 2, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading

xuanzhang816 Oct 10, 2024 •

edited

Loading