[distributed] add PG APIs and general doc cleanups #140853

d4l3k · 2024-11-15T22:01:39Z

Doc updates:

This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as RFC-0042-torch-distributed-redesign rfcs#71 .
It also does some general cleanups to simplify the distributed.rst by using :methods.
It adds __init__ definitions for the Stores
I've reordered things so the collective APIs are before the Store/PG apis

Test plan:

lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o

pytorch-bot · 2024-11-15T22:01:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140853

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 8662b64 with merge base 48a276c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2024-11-15T22:44:29Z

docs/source/distributed.rst

+
+There are some notable differences:
+
+* These APIs are always asynchronous and require you to manually synchronize the returned :class:`~torch.distributed.Work` objects.


Can't users still manually create a PG and then call dist.all_reduce(group=pg) and get the same thing, but without skipping our validation layers that are only present in the upper layer?

I'm still wondering in what cases it would be good to call pg.all_reduce instead. (Granted, my other PRs for adding 'group_src'/'group_dst' are needed for this statement to hold.)

Also, I haven't looked at it carefully but I am worried about manual PG creation skipping our UUID logic. I think that ends up not mattering if you use a prefix store for each pg, but if you use the same prefix store for 2 PGs by accident you'd be in trouble right? Should we actually encourage users to use ProcessGroup ctor? or should we improve our new_group type APIs to let the PGs be created without a harmful tie to the world?

Yeah, I talked with vllm in one of their PRs.
A potential thing we can do is to improve the new_group API so that it does not require init_process_group (world creation) first.
vllm-project/vllm#10216 (review)

agreed, removed this documentation in favor of your other PRs

wconstab · 2024-11-15T22:47:35Z

torch/csrc/distributed/c10d/init.cpp

@@ -2845,7 +2964,8 @@ options :class:`~torch.distributed.ProcessGroupNCCL.Options`).
          .def(
              "abort",
              &::c10d::ProcessGroupNCCL::abort,
-              py::call_guard<py::gil_scoped_release>())
+              py::call_guard<py::gil_scoped_release>(),
+              R"(Abort the process group.)")


note to check; do we have proper docs about this API in our c10d layer? iirc it is a new/experimental api

I don't think we have any docs on abort

wconstab · 2024-11-15T22:49:27Z

was going to stamp anyway, despite my question above, since the overall changes are good. But there is some problem, it looks like the doc build is failing. I noticed when I tried to click on the py docs build link and it didn't open.

2024-11-15T22:44:31.1428247Z WARNING: autodoc: failed to import class 'ProcessGroupNCCL' from module 'torch.distributed'; the following exception was raised:
2024-11-15T22:44:31.1428955Z Traceback (most recent call last):
2024-11-15T22:44:31.1429559Z   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/sphinx/util/inspect.py", line 376, in safe_getattr
2024-11-15T22:44:31.1430177Z     return getattr(obj, name, *defargs)
2024-11-15T22:44:31.1430672Z AttributeError: module 'torch.distributed' has no attribute 'ProcessGroupNCCL'
2024-11-15T22:44:31.1431074Z 
2024-11-15T22:44:31.1431292Z The above exception was the direct cause of the following exception:

kwen2501

Overall looks good.
I left a couple comments inline. Would appreciate it if we could "de-emphasize" the support before merge.

kwen2501 · 2024-11-17T02:49:48Z

docs/source/distributed.rst

+APIs but if you need more control over the process groups (i.e. dynamic world
+sizes) you can directly instantiate them.


Sorry, what does the "dynamic world size" feature refer to?

I am a little bit unsure about suggesting direct instantiation of ProcessGroupNCCL or ProcessGroupGloo in our documentation. The main reason is that it may go against device generalization. As in, if other device backends follow, they may also request a pybind of their backend classes to dist and an exposure in doc.

For power users, such usage would be per their choice (after digging into our code) and that's okay.

kwen2501 · 2024-11-17T02:55:05Z

docs/source/distributed.rst

+
+There are some notable differences:
+
+* These APIs are always asynchronous and require you to manually synchronize the returned :class:`~torch.distributed.Work` objects.


Yeah, I talked with vllm in one of their PRs.
A potential thing we can do is to improve the new_group API so that it does not require init_process_group (world creation) first.
vllm-project/vllm#10216 (review)

kwen2501 · 2024-11-17T03:33:09Z

docs/source/distributed.rst

+Object Oriented Process Groups
+------------------------------


The title sounds like we support member method calling on these ProcessGroup objects? I am not sure we have signed up to support that?

yeah, upon further discussion I think it's best to not recommend users use this

d4l3k · 2024-11-18T23:22:22Z

I updated this to remove documenting the PG API and just keeping the doc cleanups.

After discussion I think it's better to improve new_group api in combination with the changes @wconstab is doing to use group ranks

d4l3k · 2024-11-18T23:24:51Z

@pytorchbot merge

pytorchmergebot · 2024-11-18T23:26:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as pytorch/rfcs#71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: pytorch#140853 Approved by: https://github.com/kwen2501

d4l3k requested a review from wconstab November 15, 2024 22:01

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 15, 2024

d4l3k requested review from c-p-i-o and fduwjj November 15, 2024 22:24

wconstab reviewed Nov 15, 2024

View reviewed changes

kwen2501 approved these changes Nov 17, 2024

View reviewed changes

[distributed] add PG APIs and general doc cleanups

8662b64

d4l3k force-pushed the d4l3k/dist_docs branch from 025e12a to 8662b64 Compare November 18, 2024 23:20

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2024

pytorchmergebot added the merging label Nov 18, 2024

pytorchmergebot added the Merged label Nov 19, 2024

pytorchmergebot closed this in 2673a44 Nov 19, 2024

pytorchmergebot removed the merging label Nov 19, 2024

d4l3k deleted the d4l3k/dist_docs branch November 19, 2024 18:20

d4l3k restored the d4l3k/dist_docs branch November 19, 2024 18:24

github-actions bot deleted the d4l3k/dist_docs branch December 20, 2024 02:05


		There are some notable differences:

		* These APIs are always asynchronous and require you to manually synchronize the returned :class:`~torch.distributed.Work` objects.

		APIs but if you need more control over the process groups (i.e. dynamic world
		sizes) you can directly instantiate them.

		Object Oriented Process Groups
		------------------------------

[distributed] add PG APIs and general doc cleanups #140853

[distributed] add PG APIs and general doc cleanups #140853

Uh oh!

Conversation

d4l3k commented Nov 15, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140853

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Nov 15, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k commented Nov 18, 2024

Uh oh!

d4l3k commented Nov 18, 2024

Uh oh!

pytorchmergebot commented Nov 18, 2024

Merge started

Uh oh!

Uh oh!

d4l3k commented Nov 15, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading

kwen2501 Nov 17, 2024 •

edited

Loading