CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[AutoParallel] Support paddle.distributed.reshard constructing GradNode #58238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel] Support paddle.distributed.reshard constructing GradNode #58238
Conversation
which is needed for pipeline parallel.
你的PR提交成功,感谢你对开源项目的贡献! |
for (int i = 0; i < 1; ++i) { | ||
out_metas[i].size() == 0 ? returns[i].resize(1) | ||
: returns[i].resize(out_metas[i].size()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥还要写一个for循环
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经删除了,thx
test/auto_parallel/reshard_api.py
Outdated
|
||
in_shard_specs = [None for i in range(len(self._shape))] | ||
out_shard_specs = [None for i in range(len(self._shape))] | ||
# out_shard_specs[self._shard] = "x" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
都是replicated会经过reshard吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成partial -> replicated了,thx
#include "paddle/fluid/eager/utils.h" | ||
#include "paddle/fluid/framework/op_registry.h" | ||
#include "paddle/fluid/imperative/tracer.h" | ||
#include "paddle/phi/api/all.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里include all引入的符号会比较多,建议只include需要的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经删除了无效头文件,thx
#include "paddle/fluid/eager/api/manual/eager_manual/nodes/nodes.h" | ||
#include "paddle/fluid/eager/api/utils/global_utils.h" | ||
#include "paddle/fluid/eager/utils.h" | ||
#include "paddle/fluid/framework/op_registry.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个静态图 op注册的头文件还需要吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经删除了无效头文件,thx
<< "reshard_func"; | ||
|
||
// Backward call reshard_func function | ||
// reshard_func(grad_out, grad_input, dist_attr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这段代码是一样的话建议先封装一个函数,比如是不是可以放到phi/api/include/tensor_utils.h中
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一段代码封装成了phi::distributed::Reshard
函数,放在paddle/phi/core/distributed/auto_parallel/reshard_utils.h中,后续会提PR把使用相同代码的地方都统一调用phi::distributed::Reshard
函数。thx
@@ -20,8 +20,11 @@ | |||
#include <string> | |||
#include <vector> | |||
|
|||
#include "paddle/phi/api/include/tensor.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi的api目录单向依赖其他目录,不是同一层,core不能include api的代码,这个接口如果操作的是paddle::Tensor,不建议放到core目录中实现
grad_node = | ||
std::shared_ptr<ReshardGradNode>(new ReshardGradNode(1, 1)); // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样写合理吗,为啥要new一下
not allowed to include files in phi/api.
6d8e840
to
1d7e5ad
Compare
…de (PaddlePaddle#58238) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation.
PR types
Others
PR changes
APIs
Description
Pcard-73145
Re-implement
paddle.distributed.reshard
API. Now it can construct GradNode automatically when call this function. Its implementation will be polished contiously during pipeline parallel backward implementation.