CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[AutoParallel] Generate spmd and reshard into phi grad api #57119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel] Generate spmd and reshard into phi grad api #57119
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
… ap/generate_spmd_and_reshard_grad
TensorDistAttr dst_dist_attr = CopyTensorDistAttrForOutput(dist_attr); | ||
std::vector<int64_t> dims_mapping(dist_attr.dims_mapping().size(), -1); | ||
dst_dist_attr.set_dims_mapping(dims_mapping); | ||
dst_dist_attr.clean_partial_status(); | ||
return dst_dist_attr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其实这里可以直接构造,因为TensorDistAttr的构造函数创建的就是replicated状态的,只需要再set一下process_mesh就好,省去这里先复制一些不必要信息的逻辑。
如果这里想要保留Input的一些信息,在CopyTensorDistAttrForOutput这个函数里,已经做了clean_partial操作,没必要再来一次
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx, 删除掉冗余调用的clean_partial_status
paddle/phi/api/lib/data_transform.cc
Outdated
if (src_tensor->dist_attr() != dist_tensor->dist_attr()) { | ||
VLOG(6) << "BwdAPI KernelOut to ApiOut - " | ||
<< ReshardDebugInfo(*src_tensor, dist_tensor->dist_attr()); | ||
auto* func = phi::distributed::ChooseProperReshardFunction( | ||
*src_tensor, dist_tensor->dist_attr()); | ||
func->Eval(dev_ctx, *src_tensor, dist_tensor->dist_attr(), dist_tensor); | ||
} else { | ||
// shallow copy dense tensor | ||
*dist_tensor->unsafe_mutable_value() = src_tensor->value(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
给Reshard记个TODO,这里后面应该把copy变成reshard的一条规则,这里就不用分支判断了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
if (out_tensor->dist_attr().is_partial()) { | ||
auto dist_attr = out_tensor->dist_attr(); | ||
dist_attr.clean_partial_status(); | ||
VLOG(6) << "FwdAPI Output P2R - " | ||
<< ReshardDebugInfo(*out_tensor, dist_attr); | ||
auto* func = | ||
phi::distributed::ChooseProperReshardFunction(*out_tensor, dist_attr); | ||
func->Eval(dev_ctx, *out_tensor, dist_attr, out_tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的逻辑貌似有漏洞,假如out_tensor的dist_attr是[S(0), P],那经过这个转换后,只能变成[S(0), R],是想要这样的状态,还是[R, R]呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是前者,这里只截断P的传递,其他的状态仍然保留,函数命名语义不准确,已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
kernel_param = self.kernel['param'] | ||
if kernel_param is None: | ||
kernel_param = input_names + attr_names | ||
|
||
infer_meta_params = ( | ||
self.infer_meta['param'] | ||
if self.infer_meta['param'] is not None | ||
else input_names + attr_names | ||
) | ||
input_decl_code = "" | ||
self.input_args_code = "" | ||
for param in infer_meta_params: | ||
input_args_code = "" | ||
for param in kernel_param: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spmd_rule
在infermeta
的配置项下,但用kernel_param
作为输入参数感觉行为上有些割裂,这里可以直接用op的args参数吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不一定可以,参与后续dense kernel计算的值才需要切分处理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
线下讨论TODO,后续应该仍然和Infermeta的参数列表保持一致,若参数不一致,需要修改infermeta的参数确保其一致,特别是反向的算子存在这个问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…dle#57119) * add grad spmd and reshard * add debug info and backward gen code * re impl matmul grad infer meta * remove matmul infershape in runtime * add eager gen code * revert matmul change * refactor matmul grad spmd rules * polish details * revert p2r change * fix set dist attr error * fix conflict * polish details * polish details by comments * remove semi prefix * add more tests for coverage
PR types
New features
PR changes
Others
Description
Pcard-73145
[AutoParallel] Generate spmd and reshard into phi grad api
在PHI API和动态图反向执行流程中生成切分推导和转换的逻辑,关键功能如下:
动态图反向计算之前,要将前向输入的dist attr传递给反向输出,否则反向输出不知道应该切分为什么状态,因此增加了响应的属性传递,并提前创建了反向 phi api 的DistTensor 输出
... // 9. Reshard Partial Output to Replicated (Temporary){}\n - 现在Partial不支持算子间传递,因此前向输出时将Partial的输出全部reshard为Replicated ...
当前PR中的实现是不完备的,原因在于matmul算子的前反向infermeta实现均存在问题,其正确的infermeta均是在kernel运算时进行的:
关于matmul的问题后续在另一个PR中处理,当前PR中对于无法处理的情况,转换为replicated进行计算,先不阻塞主体架构验证工作