CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
codecache: pull out some Graph serialization code into common helpers #141502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141502
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 530614a with merge base 4959784 ( UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
…mon helpers" Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
# models to disk. | ||
self.current_callable = None | ||
|
||
def after_deserialization(self, gm: Optional[torch.fx.GraphModule]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang , this is basically just pulling out the code you wanted to refactor into post_compile, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also qq (maybe for @aorenste) do we do this serialization even when the outputcode is generated via AOTI (not CompiledFxGraph)? I.e. does it belong in OutputCode, or does it belong in CompiledFxGraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we want this separate from post_compile (there's probably a good reason, I'm just not seeing it). I am not sure why this isn't a shared method in the top level protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may not be a good reason. I did all this code before post_compile/OutputCode existed - so it may make sense to merge it all together better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual purpose of the PR seems good, but we may want to change/bikeshed some names around to figure out the order these functions should be called.
I think it's confusing to both ahve a post_compile
step and a after_deserialization
step, but I also think after_deserialization
is a better name. I guess concretely I think, assuming this after_deserialization always runs on all types of OutputCodes:
- Define
OutputCode.post_serialize
to be thisafter_deserialization
function - Rename
CompiledFxGraph.post_compile
toCompiledFxGraph.post_serialize
, and haveCompiledFxGraph
inherit fromOutputCode
's definition by callingOutputCode
first.
That said, today CompiledFxGraph isn't even an actual child class of CompiledFxGraph. Not sure what's stopping us from making that happen though. Will wait for @ezyang 's thoughts.
except OSError: | ||
# Not expected, but in case the PyCodeCache entry is removed from | ||
# underneath us, treat it as a cache miss and recompile. | ||
log.error("Failed to load cached artifact: %s", artifact_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the log line seems worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log still happens - just inside after_deserialization
which will raise an exception (because it doesn't know if ignoring the exception is the right thing or not).
@@ -1223,25 +1204,17 @@ def iterate_over_candidates() -> Generator[CompiledFxGraph, None, None]: | |||
if len(meta.cached_kernel_names) > 0: | |||
get_metrics_context().increment("num_triton_bundles", 1) | |||
|
|||
inductor_meta = autotune_cache.inductor_meta_from_config() | |||
AutotuneCacheBundler.begin_compile(inductor_meta, code=code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this after the OSError catch looks like a bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really. The problem is that in order to look up the autotune cache we need the artifact code - but we don't have the artifact code until after after_deserialization() is called. AutotuneCacheBundler.begin_compile() just needs to get called sometime after you have the artifact code and before the first autotune occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug fix is having begun the compile and then immediately erroring out (without ending compile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying. Yeah - the AutotuneCacheBundler is a weird thing because our compile doesn't really have a good place to "wrap" (it would have to be somewhere in the dynamo call
bytecode handler). begin_compile
really just means "go fetch the compiled artifacts and spit them out on disk as local artifacts". end_compile
really just means "collect all the local artifacts and bundle them into a single remote artifact". So not having an end doesn't really mean anything bad - it's supposed to handle that case properly (since it can happen in lots of ways).
# so we serialize their PyCodeCache disk cache location instead. | ||
# TODO: This could be better if we're ever able to serialize compiled | ||
# models to disk. | ||
self.current_callable = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK for refactoring but I don't want a "mutate the object so I can serialize it" API to be the final way of doing this. Definitely want a better API.
from .graph import GraphLowering | ||
|
||
# This is used by tests to check the output for specific details. | ||
GraphLowering.save_output_code(code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was moved here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like forward progress, probably need to work on this code more though
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation. - Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior. - Add an async interface. (See below) The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that. An async function is just a syntactic way to return an asyncio.Future. Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called. Pull Request resolved: #141505 Approved by: https://github.com/ezyang ghstack dependencies: #141502
…pytorch#141502) Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. Pull Request resolved: pytorch#141502 Approved by: https://github.com/ezyang
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation. - Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior. - Add an async interface. (See below) The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that. An async function is just a syntactic way to return an asyncio.Future. Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called. Pull Request resolved: pytorch#141505 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#141502
…pytorch#141502) Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile. Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache. Pull Request resolved: pytorch#141502 Approved by: https://github.com/ezyang
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation. - Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior. - Add an async interface. (See below) The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that. An async function is just a syntactic way to return an asyncio.Future. Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called. Pull Request resolved: pytorch#141505 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#141502
Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.
Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov