Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator #60516

eee4017 · 2024-01-02T15:57:07Z

PR types

New features

PR changes

Others

Description

This PR introduces Pipeline Parallelism (PP) and Virtual-Pipeline Parallelism (VP) training through the integration of CUDA Graph. The following is a detailed breakdown of the challenges encountered and the innovative solutions we have implemented:

PP/VP + CUDA Graph

Usage: Enable CUDA Graph in PipelineLayer using the use_cudagraph=true flag.

PP and CUDA Graph:
- Challenge: PP layers differ from standard nn.Layer as they are distributed across multiple GPUs. They are of two types: Layer and SharedLayer. SharedLayer's need for synchronous communication conflicts with CUDA Graph's requirements.
- Solution: We have restructured PP layers into multiple capturable sublayers (contiguous Layers), enabling each to be captured into a separate graph. This facilitates efficient multi-GPU execution and mitigates the communication conflicts inherent in SharedLayers.
VP and CUDA Graph:
- Challenge: VP employs a non-traditional training pattern, executing multiple forward (FW) and backward (BW) stages concurrently. This necessitates continuous access to previous input data for each layer throughout the training cycle.
- Solution: To accommodate this, we've integrated a queue into the CUDAGraphedLayer to store previous inputs. Each virtual pipeline stage is then captured as a separate graph, ensuring efficient and accurate training.

CUDAMallocAsyncAllocator

Usage: Activate the CUDAMallocAsyncAllocator by setting FLAG_use_cuda_malloc_async_allocator=1.

CUDAMallocAsyncAllocator
The CUDAMallocAsyncAllocator replaced the StreamSafeCUDAAllocator. By leveraging the advanced capabilities of cudaMallocAsync and cudaFreeAsync, the responsibility for stream-ordered memory management is transferred from the framework to CUDA. This transition may lead to better memory utilization and potentially improved application performance by optimizing the way memory is allocated and deallocated within the CUDA.
CUDAMallocAsyncAllocator + CUDAGraph
- Challenge: The integration of PP/VP with CUDA Graph led to an excessive creation of graphs, and each graph has its own memory pool in Paddle's existing implementation. This resulted in low memory reuse and high memory footprint as allocations were not released until the deletion of the graph.
- Solution: The introduction of the CUDAMallocAsyncAllocator provides a sophisticated solution to this challenge. By capturing memory allocation and deallocation semantics within the operations of CUDAGraph, the allocator significantly optimizes memory management. For instance, in the context of GPT3_1.3B+PP4+BF16 CUDAGraph training using 4 H100-80GB GPUs, the implementation of CUDAMallocAsyncAllocator has been observed to drastically reduce memory usage from a 95% down to 25%.

paddle-bot · 2024-01-02T15:57:13Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

eee4017 · 2024-01-02T16:04:05Z

paddle/fluid/memory/allocation/allocator_facade.cc

-            << " for StreamSafeCUDAAllocator(" << allocator.get() << ") in "
-            << place;
+    if (auto allocator = std::dynamic_pointer_cast<StreamSafeCUDAAllocator>(
+            GetDefaultStreamSafeCUDAAllocator(place))) {


We have 3 design choices here

Inheritance from Allocator: Currently, both StreamSafeCUDAAllocator and CUDAMallocAsyncAllocator inherit directly from the Allocator class. However, only StreamSafeCUDAAllocator and CUDAMallocAsyncAllocator possess stream-related methods, while other allocator types do not.

Pros: This approach aligns well with the conceptual model where CUDAMallocAsyncAllocator is seen as a alternative to StreamSafeCUDAAllocator.

Cons: We must upcast to Allocator and then downcast back to CUDAMallocAsyncAllocator or StreamSafeCUDAAllocator.

Centralizing Stream-Related Methods: Another approach is to move stream-related methods into the base Allocator class. For allocators that do not support these methods, they would trigger a runtime error.

Cons: It complicates the base class with methods that are irrelevant for some of its subclasses, violating the principle of interface segregation.

Inheriting CUDAMallocAsyncAllocator from StreamSafeCUDAAllocator: The third option considers making CUDAMallocAsyncAllocator inherit from StreamSafeCUDAAllocator. This approach implies a direct relationship between the two, with one being a more specific version of the other.

Cons: This design is conceptually awkward as it positions CUDAMallocAsyncAllocator as a subtype of StreamSafeCUDAAllocator, despite it being intended as a replacement. It might imply a false relationship or hierarchy, leading to confusion and potentially misused inheritance.

eee4017 · 2024-01-02T16:07:05Z

paddle/phi/backends/gpu/cuda/cuda_graph_with_memory_pool.h

@@ -38,10 +38,10 @@ inline bool IsCUDAGraphCapturing() {
 // Add reset callback if CUDA Graph is capturing.
 // Otherwise, invoke callback directly.
 template <typename Callback>
-inline void AddResetCallbackIfCapturingCUDAGraph(Callback &&callback) {
+inline void AddPostResetCallbackIfCapturingCUDAGraph(Callback &&callback) {


This API has been renamed for clarity.

eee4017 · 2024-01-02T16:07:14Z

python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py

@@ -800,6 +800,7 @@ def _backward_step(self, input_tensor, output_tensor, output_tensor_grad):
                        [t.grad for t in input_tensor if not t.stop_gradient]
                    )
                else:
+                    assert input_tensor.grad is not None


If input_tensor.grad is None, this may cause a hanging issue.

paddle/phi/core/flags.cc

paddle/fluid/memory/allocation/allocator_facade.cc

paddle/fluid/memory/allocation/cuda_malloc_async_allocator.cc

JZ-LIANG

LGTM

tianshuo78520a

LGTM， print可以下一个PR 修改下么？

onecatcn · 2024-01-11T03:40:25Z

2024-01-09 18:03:11 0. You must have one RD (lanxianghit (Recommend), phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables.

eee4017 · 2024-01-11T04:39:18Z

LGTM， print可以下一个PR 修改下么？

OK

paddle-ci-bot · 2024-01-17T03:04:19Z

Sorry to inform you that 5674da9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

gongweibao · 2024-01-29T03:13:02Z

paddle/phi/backends/gpu/cuda/cuda_graph.h

  bool is_reset_{false};
  std::mutex mtx_;

  std::vector<SetSeedFunc> set_seed_funcs_;
+
+  std::vector<std::function<void()>> cudagraph_post_reset_callbacks_;


Add some comments？

gongweibao

Please add more UT to check and protect these codes.

gongweibao

LGTM

JiabinYang

LGTM for _C_ops

lanxianghit

LGTM for flags

XiaoguangHu01

LGTM

From00 · 2024-02-20T02:52:47Z

paddle/fluid/memory/allocation/cuda_malloc_async_allocator.cc

+      place_(place),
+      default_stream_(default_stream) {
+  PADDLE_ENFORCE_GPU_SUCCESS(
+      cudaStreamCreateWithPriority(&memory_stream_, cudaStreamNonBlocking, 0));


Where is memory_stream_ used?

Currently, the memory_stream_ serves no immediate function and is being retained for potential future applications (see code here). The original intention behind its design was to simplify the memory management process by utilizing a single stream specialize for both malloc and free operations. This approach aims to eliminate the need for complex host-side blocking mechanisms related to CUDA events that used in StreamSafeCUDAAllocator.

From00 · 2024-02-20T03:27:03Z

paddle/fluid/memory/allocation/cuda_malloc_async_allocator.h

+  std::map<gpuStream_t, gpuEvent_t> event_map_;
+};
+
+// The `CUDAMallocAsyncAllocator` class extends `Allocator` and is specialized


When introducing CUDAMallocAsyncAllocator with stream-ordered semantics, why do we still need a complex CUDA event releated mechanism similar to StreamSafeCUDAAllocator involving EventRecord, EventQuery and unfreed Allocation management? Since in Paddle the Allocator malloc and free are dispatched to the same stream as their relevant OP kernel, and the cross-stream synchronization for kernel is guaranteed by upstream code, can we completely transfer all management responsibilities to CUDA, and just do some simple CUDAMalloc/FreeAsync calls in Allocator?

The development of a mechanism designed to fully offload memory management responsibilities to CUDA is currently in progress, utilizing memory streams for this specific aim. This initiative remains under construction and could see significant advancements in forthcoming pull requests (Note here). At present, the necessity for employing APIs akin to EventRecord, EventQuery, and the management of unfreed allocations persists, primarily because the stream passed into the Allocator is the default stream, and the RecordStream is employed to annotate the specific stream that the memory block operates on, this semantic is not fully compatible with stream-order allocator. Our ongoing efforts are focused on addressing these issues, with a commitment to refining and enhancing this aspect in future updates.

From00 · 2024-02-20T04:34:14Z

LGTM

paddle-bot bot added the contributor External developers label Jan 2, 2024

eee4017 commented Jan 2, 2024

View reviewed changes

eee4017 marked this pull request as draft January 2, 2024 16:10

eee4017 marked this pull request as ready for review January 2, 2024 16:11

eee4017 marked this pull request as draft January 2, 2024 16:14

jeng1220 added the NVIDIA label Jan 3, 2024

eee4017 force-pushed the fralin/cuda_graphed_layer_pp_async_allocator_github branch from 67b9e84 to 3cee1ed Compare January 3, 2024 08:58

eee4017 marked this pull request as ready for review January 3, 2024 08:59

From00 reviewed Jan 4, 2024

View reviewed changes

paddle/phi/core/flags.cc Outdated Show resolved Hide resolved

paddle/fluid/memory/allocation/allocator_facade.cc Show resolved Hide resolved

paddle/fluid/memory/allocation/cuda_malloc_async_allocator.cc Outdated Show resolved Hide resolved

onecatcn assigned ForFishes and JZ-LIANG and unassigned ForFishes Jan 8, 2024

eee4017 force-pushed the fralin/cuda_graphed_layer_pp_async_allocator_github branch 2 times, most recently from a383a81 to 5674da9 Compare January 9, 2024 09:59

JZ-LIANG previously approved these changes Jan 10, 2024

View reviewed changes

tianshuo78520a approved these changes Jan 11, 2024

View reviewed changes

onecatcn requested a review from lanxianghit January 11, 2024 03:40

eee4017 dismissed JZ-LIANG’s stale review via ff263e2 January 23, 2024 12:04

eee4017 force-pushed the fralin/cuda_graphed_layer_pp_async_allocator_github branch 2 times, most recently from 7036074 to dec18a2 Compare January 24, 2024 01:50

sneaxiy previously approved these changes Jan 29, 2024

View reviewed changes

gongweibao reviewed Jan 29, 2024

View reviewed changes

gongweibao requested changes Jan 29, 2024

View reviewed changes

eee4017 dismissed sneaxiy’s stale review via 8224222 January 30, 2024 05:42

eee4017 force-pushed the fralin/cuda_graphed_layer_pp_async_allocator_github branch from 47a0217 to 9496e76 Compare January 30, 2024 08:40

eee4017 requested a review from gongweibao January 30, 2024 10:43

eee4017 added 6 commits February 6, 2024 08:08

fix bug

31c6159

cuda 12.2

5f7fc72

rebase develop with 60860

6bf9924

fix 60860 flags

2ff1e30

fix 60860 flags

da94083

fix 60860 flags

64a6845

eee4017 force-pushed the fralin/cuda_graphed_layer_pp_async_allocator_github branch from 817c279 to 64a6845 Compare February 6, 2024 08:08

onecatcn requested review from gongweibao, sneaxiy and JZ-LIANG February 7, 2024 07:10

risemeup1 previously approved these changes Feb 7, 2024

View reviewed changes

eee4017 dismissed risemeup1’s stale review via 8be2ed5 February 17, 2024 13:56

eee4017 force-pushed the fralin/cuda_graphed_layer_pp_async_allocator_github branch 2 times, most recently from 8be2ed5 to 64a6845 Compare February 19, 2024 00:58

risemeup1 approved these changes Feb 19, 2024

View reviewed changes

gongweibao approved these changes Feb 19, 2024

View reviewed changes

onecatcn removed request for sneaxiy and JZ-LIANG February 19, 2024 12:04

JiabinYang approved these changes Feb 19, 2024

View reviewed changes

lanxianghit approved these changes Feb 19, 2024

View reviewed changes

XiaoguangHu01 approved these changes Feb 20, 2024

View reviewed changes

onecatcn requested a review from XieYunshen February 20, 2024 01:56

From00 reviewed Feb 20, 2024

View reviewed changes

From00 approved these changes Feb 20, 2024

View reviewed changes

JZ-LIANG merged commit aac374c into PaddlePaddle:develop Feb 20, 2024

eee4017 mentioned this pull request Feb 22, 2024

fix CUDAGraph UT (test_cuda_graph_partial_graph_static_run, test_cuda_graphed_layer) #61851

Merged

EmmonsCurse mentioned this pull request Feb 26, 2024

【cuda support】Has Paddle discontinued support for compiling based on the develop branch with CUDA versions lower than 11.2? #62066

Closed

eee4017 mentioned this pull request Jun 12, 2024

The CUDA Async Allocator #65092

Merged

eee4017 mentioned this pull request Jun 25, 2024

[Paddle][CUDAGraph] 175B GPT-3 Hybrid-Parallel Training with CUDAGraph NVIDIA/TransformerEngine#957

Merged

13 tasks

Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator #60516

Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator #60516

Uh oh!

Conversation

eee4017 commented Jan 2, 2024

PR types

PR changes

Description

PP/VP + CUDA Graph

CUDAMallocAsyncAllocator

Uh oh!

paddle-bot bot commented Jan 2, 2024

Uh oh!

eee4017 Jan 2, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Jan 2, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Jan 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JZ-LIANG left a comment

Choose a reason for hiding this comment

Uh oh!

tianshuo78520a left a comment

Choose a reason for hiding this comment

Uh oh!

onecatcn commented Jan 11, 2024

Uh oh!

eee4017 commented Jan 11, 2024

Uh oh!

paddle-ci-bot bot commented Jan 17, 2024

Uh oh!

gongweibao Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gongweibao left a comment

Choose a reason for hiding this comment

Uh oh!

gongweibao left a comment

Choose a reason for hiding this comment

Uh oh!

JiabinYang left a comment

Choose a reason for hiding this comment

Uh oh!

lanxianghit left a comment

Choose a reason for hiding this comment

Uh oh!

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Uh oh!

From00 Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

From00 Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

From00 commented Feb 20, 2024

Uh oh!

Uh oh!

eee4017 Jan 30, 2024 •

edited

Loading

eee4017 Feb 20, 2024 •

edited

Loading