CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator #60516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline and Virtual-Pipeline Parallelism Using CUDA Graph and Integrate CUDAMallocAsyncAllocator #60516
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
<< " for StreamSafeCUDAAllocator(" << allocator.get() << ") in " | ||
<< place; | ||
if (auto allocator = std::dynamic_pointer_cast<StreamSafeCUDAAllocator>( | ||
GetDefaultStreamSafeCUDAAllocator(place))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 3 design choices here
- Inheritance from Allocator: Currently, both StreamSafeCUDAAllocator and CUDAMallocAsyncAllocator inherit directly from the Allocator class. However, only StreamSafeCUDAAllocator and CUDAMallocAsyncAllocator possess stream-related methods, while other allocator types do not.
- Pros: This approach aligns well with the conceptual model where CUDAMallocAsyncAllocator is seen as a alternative to StreamSafeCUDAAllocator.
- Cons: We must upcast to Allocator and then downcast back to CUDAMallocAsyncAllocator or StreamSafeCUDAAllocator.
- Centralizing Stream-Related Methods: Another approach is to move stream-related methods into the base Allocator class. For allocators that do not support these methods, they would trigger a runtime error.
- Cons: It complicates the base class with methods that are irrelevant for some of its subclasses, violating the principle of interface segregation.
- Inheriting CUDAMallocAsyncAllocator from StreamSafeCUDAAllocator: The third option considers making CUDAMallocAsyncAllocator inherit from StreamSafeCUDAAllocator. This approach implies a direct relationship between the two, with one being a more specific version of the other.
- Cons: This design is conceptually awkward as it positions CUDAMallocAsyncAllocator as a subtype of StreamSafeCUDAAllocator, despite it being intended as a replacement. It might imply a false relationship or hierarchy, leading to confusion and potentially misused inheritance.
@@ -38,10 +38,10 @@ inline bool IsCUDAGraphCapturing() { | |||
// Add reset callback if CUDA Graph is capturing. | |||
// Otherwise, invoke callback directly. | |||
template <typename Callback> | |||
inline void AddResetCallbackIfCapturingCUDAGraph(Callback &&callback) { | |||
inline void AddPostResetCallbackIfCapturingCUDAGraph(Callback &&callback) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API has been renamed for clarity.
@@ -800,6 +800,7 @@ def _backward_step(self, input_tensor, output_tensor, output_tensor_grad): | |||
[t.grad for t in input_tensor if not t.stop_gradient] | |||
) | |||
else: | |||
assert input_tensor.grad is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If input_tensor.grad is None, this may cause a hanging issue.
67b9e84
to
3cee1ed
Compare
a383a81
to
5674da9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, print可以下一个PR 修改下么?
2024-01-09 18:03:11 0. You must have one RD (lanxianghit (Recommend), phlrain or luotao1 or Aurelius84) approval for changing the FLAGS, which manages the environment variables. |
OK |
Sorry to inform you that 5674da9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
7036074
to
dec18a2
Compare
bool is_reset_{false}; | ||
std::mutex mtx_; | ||
|
||
std::vector<SetSeedFunc> set_seed_funcs_; | ||
|
||
std::vector<std::function<void()>> cudagraph_post_reset_callbacks_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more UT to check and protect these codes.
47a0217
to
9496e76
Compare
817c279
to
64a6845
Compare
8be2ed5
to
64a6845
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for _C_ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for flags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
place_(place), | ||
default_stream_(default_stream) { | ||
PADDLE_ENFORCE_GPU_SUCCESS( | ||
cudaStreamCreateWithPriority(&memory_stream_, cudaStreamNonBlocking, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is memory_stream_
used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the memory_stream_ serves no immediate function and is being retained for potential future applications (see code here). The original intention behind its design was to simplify the memory management process by utilizing a single stream specialize for both malloc and free operations. This approach aims to eliminate the need for complex host-side blocking mechanisms related to CUDA events that used in StreamSafeCUDAAllocator.
std::map<gpuStream_t, gpuEvent_t> event_map_; | ||
}; | ||
|
||
// The `CUDAMallocAsyncAllocator` class extends `Allocator` and is specialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When introducing CUDAMallocAsyncAllocator with stream-ordered semantics, why do we still need a complex CUDA event releated mechanism similar to StreamSafeCUDAAllocator involving EventRecord, EventQuery and unfreed Allocation management? Since in Paddle the Allocator malloc and free are dispatched to the same stream as their relevant OP kernel, and the cross-stream synchronization for kernel is guaranteed by upstream code, can we completely transfer all management responsibilities to CUDA, and just do some simple CUDAMalloc/FreeAsync calls in Allocator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The development of a mechanism designed to fully offload memory management responsibilities to CUDA is currently in progress, utilizing memory streams for this specific aim. This initiative remains under construction and could see significant advancements in forthcoming pull requests (Note here). At present, the necessity for employing APIs akin to EventRecord, EventQuery, and the management of unfreed allocations persists, primarily because the stream passed into the Allocator is the default stream, and the RecordStream
is employed to annotate the specific stream that the memory block operates on, this semantic is not fully compatible with stream-order allocator. Our ongoing efforts are focused on addressing these issues, with a commitment to refining and enhancing this aspect in future updates.
LGTM |
PR types
New features
PR changes
Others
Description
This PR introduces Pipeline Parallelism (PP) and Virtual-Pipeline Parallelism (VP) training through the integration of CUDA Graph. The following is a detailed breakdown of the challenges encountered and the innovative solutions we have implemented:
PP/VP + CUDA Graph
Usage: Enable CUDA Graph in PipelineLayer using the
use_cudagraph=true
flag.CUDAMallocAsyncAllocator
Usage: Activate the
CUDAMallocAsyncAllocator
by settingFLAG_use_cuda_malloc_async_allocator=1
.CUDAMallocAsyncAllocator
The
CUDAMallocAsyncAllocator
replaced theStreamSafeCUDAAllocator
. By leveraging the advanced capabilities ofcudaMallocAsync
andcudaFreeAsync
, the responsibility for stream-ordered memory management is transferred from the framework to CUDA. This transition may lead to better memory utilization and potentially improved application performance by optimizing the way memory is allocated and deallocated within the CUDA.CUDAMallocAsyncAllocator + CUDAGraph