CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[distributed] add PG APIs and general doc cleanups #140853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
đź”— Helpful Linksđź§Ş See artifacts and rendered test results at hud.pytorch.org/pr/140853
Note: Links to docs will display an error until the docs builds have been completed. âť— 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: âś… No FailuresAs of commit 8662b64 with merge base 48a276c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
docs/source/distributed.rst
Outdated
|
||
There are some notable differences: | ||
|
||
* These APIs are always asynchronous and require you to manually synchronize the returned :class:`~torch.distributed.Work` objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't users still manually create a PG and then call dist.all_reduce(group=pg)
and get the same thing, but without skipping our validation layers that are only present in the upper layer?
I'm still wondering in what cases it would be good to call pg.all_reduce
instead. (Granted, my other PRs for adding 'group_src'/'group_dst' are needed for this statement to hold.)
Also, I haven't looked at it carefully but I am worried about manual PG creation skipping our UUID logic. I think that ends up not mattering if you use a prefix store for each pg, but if you use the same prefix store for 2 PGs by accident you'd be in trouble right? Should we actually encourage users to use ProcessGroup ctor? or should we improve our new_group
type APIs to let the PGs be created without a harmful tie to the world?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I talked with vllm in one of their PRs.
A potential thing we can do is to improve the new_group
API so that it does not require init_process_group
(world creation) first.
vllm-project/vllm#10216 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, removed this documentation in favor of your other PRs
@@ -2845,7 +2964,8 @@ options :class:`~torch.distributed.ProcessGroupNCCL.Options`). | |||
.def( | |||
"abort", | |||
&::c10d::ProcessGroupNCCL::abort, | |||
py::call_guard<py::gil_scoped_release>()) | |||
py::call_guard<py::gil_scoped_release>(), | |||
R"(Abort the process group.)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to check; do we have proper docs about this API in our c10d layer? iirc it is a new/experimental api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any docs on abort
was going to stamp anyway, despite my question above, since the overall changes are good. But there is some problem, it looks like the doc build is failing. I noticed when I tried to click on the py docs build link and it didn't open.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
I left a couple comments inline. Would appreciate it if we could "de-emphasize" the support before merge.
docs/source/distributed.rst
Outdated
APIs but if you need more control over the process groups (i.e. dynamic world | ||
sizes) you can directly instantiate them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what does the "dynamic world size" feature refer to?
I am a little bit unsure about suggesting direct instantiation of ProcessGroupNCCL
or ProcessGroupGloo
in our documentation. The main reason is that it may go against device generalization. As in, if other device backends follow, they may also request a pybind of their backend classes to dist
and an exposure in doc.
For power users, such usage would be per their choice (after digging into our code) and that's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
docs/source/distributed.rst
Outdated
|
||
There are some notable differences: | ||
|
||
* These APIs are always asynchronous and require you to manually synchronize the returned :class:`~torch.distributed.Work` objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I talked with vllm in one of their PRs.
A potential thing we can do is to improve the new_group
API so that it does not require init_process_group
(world creation) first.
vllm-project/vllm#10216 (review)
docs/source/distributed.rst
Outdated
Object Oriented Process Groups | ||
------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title sounds like we support member method calling on these ProcessGroup objects? I am not sure we have signed up to support that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, upon further discussion I think it's best to not recommend users use this
025e12a
to
8662b64
Compare
I updated this to remove documenting the PG API and just keeping the doc cleanups. After discussion I think it's better to improve |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501
Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501
Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501
Doc updates:
:methods
.__init__
definitions for the StoresTest plan:
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o