[Dy2St] Clean unused inputs and outputs for backward #66278

SigureMo · 2024-07-19T09:05:48Z

PR Category

Execute Infrastructure

PR Types

Performance

Description

修复 SOT+PIR 下 OOM 的问题，主要是清理掉反向用不到的前向输入和前向输出，使之前向结束后不再 hold 在 scope 中，详细分析见 comment

PCard-66972

paddle-bot · 2024-07-19T09:05:52Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

SigureMo · 2024-07-19T09:12:01Z

现象

CSWinTransformer 在 SOT+PIR 下，V100 16G 机器上会 OOM，其他几种模式包括 SOT+PT、AST+PT、AST+PIR、动态图峰值显存全都是 13G 左右，而 SOT+PIR 则会 OOM

SOT+PIR 第一个 step 的前向都没跑完就 OOM 了

从代码上来看，首先 SOT 代码在 PIR 和 PT 基本一致，基本不可能是 SOT 多 hold 什么 Tensor，但 AST 那层也就是 partial_program -> run_program OP 这边代码就不一致了，大概率问题出在这里

分析

组网分析

从可比性来讲，SOT+PT 和 SOT+PIR 下差异是最小的，只要让 SOT+PIR 对齐 SOT+PT 的策略就能保证显存对齐，因此如下分析都针对于两者
首先两者都开启 SOT_LOG_LEVEL=2 对比 SOT 的组网 log，对比差异发现 OOM 之前组网 SIR 完全一致，因此组网层面是相同的，那么就可以进一步分析各个子图内的问题了

子图分析

在那之前，我们先看看显存要如何分析

I0718 13:26:58.767019 22805 auto_growth_best_fit_allocator.cc:131] Alloc 3538944 bytes, ptr = 0x7f40fed80000
...
I0718 13:26:58.772142 22805 auto_growth_best_fit_allocator.cc:139] Free 3538944 bytes, ptr = 0x7f40fed80000

我们的日志里存在如上的内存分配 log，因此可以通过这个来分析子图的显存问题，首先收集 log

# 手动将上面代码 VLOG(10) 改成了 VLOG(7)，因为 GLOG_v=10 PT 会挂，暂不分析
# 当然也可以用 GLOG_vmodule=auto_growth_best_fit_allocator=10 来只收集显存信息，前期只分析显存可以这样，后期因为需要分析周围 log，所以都需要 dump
# 设置 FLAGS_new_executor_sequential_run 确保顺序是可比的
GLOG_v=7 FLAGS_new_executor_sequential_run=true SOT_LOG_LEVEL=0 ENABLE_FALL_BACK=True MIN_GRAPH_SIZE=0 FLAGS_enable_pir_api=1 python tools/train.py -c ppcls/configs/ImageNet/CSWinTransformer/CSWinTransformer_base_384.yaml \
 -o Global.epochs=1 \
 -o Global.save_interval=1 \
 -o Global.eval_interval=1 \
 -o Global.seed=1234 \
 -o DataLoader.Train.dataset.image_root=/workspace/PaddleClas/dataset/ILSVRC2012/ \
 -o DataLoader.Train.dataset.cls_label_path=/workspace/PaddleClas/dataset/ILSVRC2012/train_list.txt \
 -o DataLoader.Train.sampler.batch_size=8 \
 -o DataLoader.Eval.dataset.image_root=/workspace/PaddleClas/dataset/ILSVRC2012/ \
 -o DataLoader.Eval.dataset.cls_label_path=/workspace/PaddleClas/dataset/ILSVRC2012/val_list.txt \
 -o DataLoader.Eval.sampler.batch_size=8 \
 -o DataLoader.Train.loader.num_workers=0 \
 -o DataLoader.Train.sampler.shuffle=False \
 -o Global.output_dir=output/ppcls/configs/ImageNet/CSWinTransformer/CSWinTransformer_base_384 \
 -o Global.to_static=True >! oom-pir-alloc.log 2>&1
 GLOG_v=7 FLAGS_new_executor_sequential_run=true SOT_LOG_LEVEL=0 ENABLE_FALL_BACK=True MIN_GRAPH_SIZE=0 FLAGS_enable_pir_api=0 python tools/train.py -c ppcls/configs/ImageNet/CSWinTransformer/CSWinTransformer_base_384.yaml \
 -o Global.epochs=1 \
 -o Global.save_interval=1 \
 -o Global.eval_interval=1 \
 -o Global.seed=1234 \
 -o DataLoader.Train.dataset.image_root=/workspace/PaddleClas/dataset/ILSVRC2012/ \
 -o DataLoader.Train.dataset.cls_label_path=/workspace/PaddleClas/dataset/ILSVRC2012/train_list.txt \
 -o DataLoader.Train.sampler.batch_size=8 \
 -o DataLoader.Eval.dataset.image_root=/workspace/PaddleClas/dataset/ILSVRC2012/ \
 -o DataLoader.Eval.dataset.cls_label_path=/workspace/PaddleClas/dataset/ILSVRC2012/val_list.txt \
 -o DataLoader.Eval.sampler.batch_size=8 \
 -o DataLoader.Train.loader.num_workers=0 \
 -o DataLoader.Train.sampler.shuffle=False \
 -o Global.output_dir=output/ppcls/configs/ImageNet/CSWinTransformer/CSWinTransformer_base_384 \
 -o Global.to_static=True >! oom-pt-alloc.log 2>&1

dump 日志，编写简单脚本快速分析显存问题：

from __future__ import annotations
from dataclasses import dataclass
from typing import Callable
import sys
import re
LOG_PATH_TEMPLATE = "oom-{}-alloc.log"
# LOG_PATH = "oom-pt-alloc.log"
# LOG_PATH = "oom-pir-alloc.log"
class MemoryInfoItem:
    def __init__(self, ptr: str, size: int):
        self.ptr = ptr
        self.size = size
def format_bytes(size: int) -> str:
    units = ["B", "KB", "MB", "GB", "TB"]
    unit_idx = 0
    sign = "" if size >= 0 else "-"
    size = abs(size)
    while size >= 1024 and unit_idx < len(units):
        size /= 1024
        unit_idx += 1
    return f"{sign}{size:.2f} {units[unit_idx]}"
class AllocMemoryInfoItem(MemoryInfoItem):
    LOG_MATCH_REGEX = re.compile(r".+Alloc (?P<size>\d+) bytes, ptr = (?P<ptr>0x\w+)")
    def __init__(self, ptr: str, size: int):
        super().__init__(ptr, size)
    def __repr__(self):
        return f"AllocMemoryInfoItem(ptr={self.ptr}, size={self.size})"
class FreeMemoryInfoItem(MemoryInfoItem):
    LOG_MATCH_REGEX = re.compile(r".+Free (?P<size>\d+) bytes, ptr = (?P<ptr>0x\w+)")
    def __init__(self, ptr: str, size: int):
        super().__init__(ptr, size)
    def __repr__(self):
        return f"FreeMemoryInfoItem(ptr={self.ptr}, size={self.size})"
def extract_memory_info(logs: list[str], start_fn: Callable[[str], bool], end_fn:  Callable[[str], bool]) -> list[MemoryInfoItem]:
    memory_info = []
    started = False
    for line in logs:
        if not started and not start_fn(line):
            continue
        started = True
        if end_fn(line):
            break
        if match_obj := AllocMemoryInfoItem.LOG_MATCH_REGEX.match(line):
            memory_info.append(AllocMemoryInfoItem(match_obj.group("ptr"), int(match_obj.group("size"))))
        elif match_obj := FreeMemoryInfoItem.LOG_MATCH_REGEX.match(line):
            memory_info.append(FreeMemoryInfoItem(match_obj.group("ptr"), int(match_obj.group("size"))))
    return memory_info
def read_log(log_path: str) -> list[str]:
    with open(log_path, "r") as f:
        lines = f.readlines()
    return lines
class MemoryAnalyzer:
    def __call__(self, memory_info: list[MemoryInfoItem]):
        ...
class RemainingMemoryAnalyzer(MemoryAnalyzer):
    def __init__(self):
        self.remaining_memory = 0
        self.max_memory = 0
    def __call__(self, memory_info: list[MemoryInfoItem]):
        remaining_memory = 0
        for item in memory_info:
            if isinstance(item, AllocMemoryInfoItem):
                remaining_memory += item.size
            elif isinstance(item, FreeMemoryInfoItem):
                remaining_memory -= item.size
            if remaining_memory > self.max_memory:
                self.max_memory = remaining_memory
        self.remaining_memory = remaining_memory
    
    def summary(self):
        print(f"Remaining memory: {format_bytes(self.remaining_memory)}")
        print(f"Max memory: {format_bytes(self.max_memory)}")
class AllocFreeStatAnalyzer(MemoryAnalyzer):
    def __init__(self):
        self.alloc_size = 0
        self.free_size = 0
        self.alloc_count = 0
        self.free_count = 0
    def __call__(self, memory_info: list[MemoryInfoItem]):
        alloc_size = 0
        free_size = 0
        alloc_count = 0
        free_count = 0
        for item in memory_info:
            if isinstance(item, AllocMemoryInfoItem):
                alloc_size += item.size
                alloc_count += 1
            elif isinstance(item, FreeMemoryInfoItem):
                free_size += item.size
                free_count += 1
        self.alloc_size = alloc_size
        self.free_size = free_size
        self.alloc_count = alloc_count
        self.free_count = free_count
    def summary(self):
        print(f"Allocated memory: {format_bytes(self.alloc_size)}")
        print(f"Allocated count: {self.alloc_count}")
        print(f"Freed memory: {format_bytes(self.free_size)}")
        print(f"Freed count: {self.free_count}")
def analyse_memory_info(memory_info: list[MemoryInfoItem]):
    # remaining_memory
    remaining_memory_analyzer = RemainingMemoryAnalyzer()
    remaining_memory_analyzer(memory_info)
    remaining_memory_analyzer.summary()
    
    # alloc free stat
    alloc_free_stat_analyzer = AllocFreeStatAnalyzer()
    alloc_free_stat_analyzer(memory_info)
    alloc_free_stat_analyzer.summary()
logs = read_log(LOG_PATH_TEMPLATE.format(sys.argv[1]))
memory_info = extract_memory_info(
    logs,
    lambda line: "START SOT_CALL_0" in line,# or "EPOCH START" in line,
    lambda line: "EPOCH END" in line or "Traceback" in line or "START SOT_CALL_429" in line
)
# print(memory_info)
analyse_memory_info(memory_info)

当然，为了能够让 log 对齐，我们在 log 中插入一些「锚点」，利用 print(anchor_name, flush=True)，就比如 START SOT_CALL_10、START SOT_CALL_429 这种，以便裁剪 log、局部显存分析等

通过该脚本，我们可以很容易分析得到各个子图的 log 分析结果

nyakku@localhost /workspace/PaddleClas develop* ⇣
paddle-py310 ❯ python memory-analyzer.py pir
Remaining memory: 13.08 GB
Max memory: 13.11 GB
Allocated memory: 21.67 GB
Allocated count: 3791
Freed memory: 8.59 GB
Freed count: 1942
nyakku@localhost /workspace/PaddleClas develop* ⇣
paddle-py310 ❯ python memory-analyzer.py pt 
Remaining memory: 9.37 GB
Max memory: 9.41 GB
Allocated memory: 21.67 GB
Allocated count: 2939
Freed memory: 12.30 GB
Freed count: 1841

比如这里分析的是 START SOT_CALL_1（第一个子图）到 START SOT_CALL_429（PIR 下 OOM 前最后一个子图）全部子图显存申请和释放 log

可以看到几点信息：

PIR 峰值显存明显高于 PT
PIR 申请显存总量等于 PT（21.67 G）
PIR 释放显存明显低于 PT

由此可以得知，问题主要是 SOT 前向时候显存释放问题，进一步缩小范围查看具体是哪个子图的问题

结果发现几乎所有子图都有问题……

那就分析一个较小子图，比如 START SOT_CALL_428 到 START SOT_CALL_429 之间的日志

nyakku@localhost /workspace/PaddleClas develop* ⇣
paddle-py310 ❯ python memory-analyzer.py pir
Remaining memory: 6.75 MB
Max memory: 10.12 MB
Allocated memory: 10.12 MB
Allocated count: 3
Freed memory: 3.38 MB
Freed count: 1
nyakku@localhost /workspace/PaddleClas develop* ⇣
paddle-py310 ❯ python memory-analyzer.py pt 
Remaining memory: 3.38 MB
Max memory: 10.12 MB
Allocated memory: 10.12 MB
Allocated count: 3
Freed memory: 6.75 MB
Freed count: 2

明显 PT 多了一次释放，编写脚本裁剪 START SOT_CALL_428 到 START SOT_CALL_429 之间的日志进行分析

组网很简单，比如 PIR 下前向组网如下：

Forward Program:
{
    (%0) = "pd_op.data" () {dtype:(pd_op.DataType)float32,name:"args_0",place:(pd_op.Place)Place(undefined:0),shape:(pd_op.IntArray)[8,192,24,24],stop_gradient:[false]} : () -> builtin.tensor<8x192x24x24xf32>
    (%1) = "pd_op.full_int_array" () {dtype:(pd_op.DataType)int64,place:(pd_op.Place)Place(cpu),stop_gradient:[true],value:[(Int64)8,(Int64)192,(Int64)2,(Int64)12,(Int64)1,(Int64)24]} : () -> builtin.tensor<6xi64>
    (%2, %3) = "pd_op.reshape" (%0, %1) {stop_gradient:[false,false]} : (builtin.tensor<8x192x24x24xf32>, builtin.tensor<6xi64>) -> builtin.tensor<8x192x2x12x1x24xf32>, builtin.tensor<0x8x192x24x24xi64>
    (%4) = "pd_op.transpose" (%2) {perm:[(Int32)0,(Int32)2,(Int32)4,(Int32)3,(Int32)5,(Int32)1],stop_gradient:[false]} : (builtin.tensor<8x192x2x12x1x24xf32>) -> builtin.tensor<8x2x1x12x24x192xf32>
    (%5) = "pd_op.full_int_array" () {dtype:(pd_op.DataType)int64,place:(pd_op.Place)Place(cpu),stop_gradient:[true],value:[(Int64)-1,(Int64)288,(Int64)192]} : () -> builtin.tensor<3xi64>
    (%6, %7) = "pd_op.reshape" (%4, %5) {stop_gradient:[false,false]} : (builtin.tensor<8x2x1x12x24x192xf32>, builtin.tensor<3xi64>) -> builtin.tensor<16x288x192xf32>, builtin.tensor<0x8x2x1x12x24x192xi64>
    () = "builtin.shadow_output" (%3) {output_name:"output_1"} : (builtin.tensor<0x8x192x24x24xi64>) -> 
    () = "builtin.shadow_output" (%7) {output_name:"output_2"} : (builtin.tensor<0x8x2x1x12x24x192xi64>) -> 
    () = "builtin.shadow_output" (%6) {output_name:"output_0"} : (builtin.tensor<16x288x192xf32>) -> 
}

对比两者日志，可以很容易发现 PIR 缺少的 Free

# PIR
I0719 05:11:12.650810 10519 grad_node_info.h:202] Destruct GradNodeBase
I0719 05:11:12.650815 10519 run_program_op_func.h:444] run_program_ad_func End
I0719 05:11:12.650820 10519 eager_legacy_custom_python_api.h:100] run_program_ad_func finished
I0719 05:11:12.650823 10519 eager_legacy_custom_python_api.h:102] After PyEval_RestoreThread
I0719 05:11:12.650827 10519 eager_legacy_custom_python_api.h:104] set tstate to nullptr
END SOT_CALL_428
# PT
I0719 05:14:21.497383 13613 grad_node_info.h:202] Destruct GradNodeBase
I0719 05:14:21.497388 13613 run_program_op_func.h:304] Construct grad node End
I0719 05:14:21.497392 13613 run_program_op_func.h:306] run_program_ad_func End
I0719 05:14:21.497399 13613 auto_growth_best_fit_allocator.cc:139] Free 3538944 bytes, ptr = 0x7ff070a00000
I0719 05:14:21.497406 13613 eager_legacy_custom_python_api.h:48] run_program_ad_func finished
I0719 05:14:21.497411 13613 eager_legacy_custom_python_api.h:50] After PyEval_RestoreThread
I0719 05:14:21.497414 13613 eager_legacy_custom_python_api.h:52] set tstate to nullptr
END SOT_CALL_428

缺少的 Free 出现在出 run_program_ad_func 作用域时，也就是说有一个 Tensor 在出作用域时引用计数减为 0 而释放了

通过排查发现这个 Tensor 是输入，但不完全是用户代码里 hold 的那个输入，因为用户的输入是 stride 的，因此传入动转静后会先转连续，申请一块新的内存，所以这个输入在用户代码侧是不会 hold 住的，因此出了作用域就释放掉了

那 PIR 为啥没释放掉呢？通过查看这个 holder 的引用计数（std::dynamic_pointer_cast<phi::DenseTensor>(x.impl())->Holder().use_count()），可以发现两者 use_count 发生 diff 的位置是在跑 RunProgramAPI后……

所以问题还是出在 RunProgramAPI 时部分 Tensor 被 hold 在了 scope，可以通过打印 SetSkipGcVars 的参数发现 PT 下并没有把输入 skip gc，而 PIR 设置了

这是因为 PIR 的算法是 skip_names = mids - no_needs + inputs + outputs，这就导致了 inputs 永远在里面，因此先尝试了改成 skip_names = mids + inputs - no_needs + outputs，这样 inputs 就有被 no_need_buffers 过滤的可能了，但是，事实上 no_need_buffers 两者都是空，因此没有对齐的不是这里

通过查看 ParseSafeEagerDeletionSkipVarsSet 的逻辑发现，PT（老 IR）做的是 skip_names = backward_need_buffers + outputs，其中 backward_need_buffers = backward_need_buffers_op_inputs - backward_op_outputs，后面只是过滤掉反向自己计算的结果，确保只保留反向的输入而已，所以其实就是反向的全部 need_buffers 的输入

那么 PIR 下 mids + inputs - no_needs 和 PT backward_need_buffers 的区别是什么呢？进一步精简就是 mids + inputs 和 backward_inputs 的区别是什么呢？这里 mids 和 inputs 都是从前向视角来看反向所有可能用到的变量，但是反向不一定用到这里所有的变量，因此就会导致 mids + inputs > backward_inputs，多 skip 一些变量的 GC，因此只需要将 mids + inputs 改成反向的输入即可，这可以通过 backward_blocks.kwargs().keys() 来获得

修改后，不再 OOM 了～通过脚本分析，显存完全对齐～

nyakku@localhost /workspace/PaddleClas develop* ⇣
paddle-py310 ❯ python memory-analyzer.py pir
Remaining memory: 11.88 GB
Max memory: 11.92 GB
Allocated memory: 21.67 GB
Allocated count: 3791
Freed memory: 9.79 GB
Freed count: 2201
nyakku@localhost /workspace/PaddleClas develop* ⇣
paddle-py310 ❯ python memory-analyzer.py pt 
Remaining memory: 11.88 GB
Max memory: 11.92 GB
Allocated memory: 21.67 GB
Allocated count: 2939
Freed memory: 9.79 GB
Freed count: 1349

但是峰值显存仍然比 PT 高一些，16G 擦边过，多一点就跑不了了，不过因为之前对比源码已经发现了 PIR 少了一段清理 scope 里反向用不到的前向输出的逻辑，PT 下注释掉该逻辑可以复现 16G 的问题（上面的日志就是注释掉之后的），因此可以确定是该段逻辑导致的，补齐该段逻辑后显存一切正常～也是 13G～

2742195759

LGTM

SigureMo

另外记一个 TODO，对于 name 获取的统一，我们现在在

pir.cc（pybind 层前反向拆分等逻辑）
pir_partial_program.py
run_program_op_node.h

各有一套 name 获取的逻辑，但是它们是能够统一的，比如 RunProgramOP 这里现在其实是遍历整个 Program，这其实是非常耗时的，事实上只需要从上下游 OP 上获取即可，对于未来极致优化性能来说是一个潜在的优化点，这件事情之后会由 @gouzil 来推进

SigureMo · 2024-07-19T15:41:08Z

paddle/fluid/eager/to_static/run_program_op_node.h

-    // *backward_program);
-
+    // Step 3. get all eager gc vars (skip_names = backward_inputs -
+    // no_need_buffers)


Suggested change

// no_need_buffers)

// no_need_buffers + outputs)

这里之后 PR 改一下

)

[Dy2St] Clean unused inputs and outputs for backward

96ce7e8

2742195759 approved these changes Jul 19, 2024

View reviewed changes

SigureMo commented Jul 19, 2024

View reviewed changes

SigureMo closed this Jul 19, 2024

SigureMo reopened this Jul 19, 2024

SigureMo merged commit f624a93 into PaddlePaddle:develop Jul 19, 2024
30 of 31 checks passed

SigureMo deleted the dy2st/clean-unused-input-and-output-for-backward branch July 19, 2024 16:13

DrRyanHuang restored the dy2st/clean-unused-input-and-output-for-backward branch July 22, 2024 02:04

DrRyanHuang deleted the dy2st/clean-unused-input-and-output-for-backward branch July 22, 2024 02:04

co63oc pushed a commit to co63oc/Paddle that referenced this pull request Jul 22, 2024

[Dy2St] Clean unused inputs and outputs for backward (PaddlePaddle#66278

3cbb074

)

lixcli pushed a commit to lixcli/Paddle that referenced this pull request Jul 22, 2024

[Dy2St] Clean unused inputs and outputs for backward (PaddlePaddle#66278

a3d529d

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dy2St] Clean unused inputs and outputs for backward #66278

[Dy2St] Clean unused inputs and outputs for backward #66278

Uh oh!

SigureMo commented Jul 19, 2024

Uh oh!

paddle-bot bot commented Jul 19, 2024

Uh oh!

SigureMo commented Jul 19, 2024 •

edited

Loading

Uh oh!

2742195759 left a comment

Uh oh!

SigureMo left a comment

Uh oh!

SigureMo Jul 19, 2024

Uh oh!

Uh oh!

Uh oh!

[Dy2St] Clean unused inputs and outputs for backward #66278

[Dy2St] Clean unused inputs and outputs for backward #66278

Uh oh!

Conversation

SigureMo commented Jul 19, 2024

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jul 19, 2024

Uh oh!

SigureMo commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

现象

分析

组网分析

子图分析

Uh oh!

2742195759 left a comment

Choose a reason for hiding this comment

Uh oh!

SigureMo left a comment

Choose a reason for hiding this comment

Uh oh!

SigureMo Jul 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SigureMo commented Jul 19, 2024 •

edited

Loading