[AutoTuner] Add auto tuner to obtain optima configuration #54460

Caozhou1995 · 2023-06-08T06:59:25Z

PR types

New features

PR changes

Others

Description

Pcard-72023

The optimal configuration for large model with distributed training/inference often requires designing multiple sets of experiments based on experiences (network, parameter size, gpu memory or flops, etc.), and comparing the results to determine the optimal configuration. This process heavily relies on human experience, and the determined optimal configuration may not be the global optimal configuration. When any condition changes, the above process needs to be repeated repeatedly, resulting in poor usability of large models.

To address the above issues, we have implemented AutoTuner based on Profiling, with the main modules as follows:

Provide clear json configuration for users to directly use AutoTuner, avoiding additional coding work for users
launch multi tasks one by one and automatically schedule and monitor.
Implement search module and pruning module, support multiple search algorithms and pruning strategies.

At present, we have built-in grid search support for 8 dimensions, including dp degree, mp degree, pp degree, mbs, sharding degree, sharding stage, recompute, and recompute granularity. The example JSON is as follows:

The usage is as follows:
python -m paddle.distributed.launch --devices "0,1,2,3,4,5,6,7" --auto_tuner_json=test.json your_train.py your_args

NOTE: Since the auto_tuner is non-invasive, users need to expose args in their script to enable the configuration generated by auto_tuner be executed.

paddle-bot · 2023-06-08T06:59:30Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

XieYunshen

LGTM for set_tests_properties(test_auto_tuner PROPERTIES LABELS "RUN_TYPE=EXCLUSIVE" TIMEOUT 100)

zhiqiu

LGTM, u can refine code with the comments in the next pr.

zhiqiu · 2023-06-14T06:33:34Z

test/auto_parallel/test_auto_tuner.py

+
+        process = subprocess.Popen(cmd)
+        process.wait()
+        self.assertEqual(process.returncode, 0)


Check the config searched?

zhiqiu · 2023-06-14T08:27:06Z

python/paddle/distributed/launch/main.py

+        import copy
+        import json
+        import signal
+        import sys
+        import time
+
+        from ..auto_tuner.tuner import AutoTuner
+        from ..auto_tuner.utils import gen_new_args
+        from . import controllers


Better import at the top of the file

zhiqiu · 2023-06-14T08:30:31Z

python/paddle/distributed/launch/main.py

+        cur_cfg = auto_tuner.search_once()
+
+        # get max time per task run
+        max_time_per_task = tuner_cfg.get("max_time_per_task", 1800)


max_time_per_task -> max_time_in_seconds_per_task?

zhiqiu · 2023-06-14T08:42:21Z

python/paddle/distributed/auto_tuner/tuner.py

+
+    def __init__(self, tuner_cfg):
+        self.cur_task_id = 1
+        self.task_limit = tuner_cfg.get("task_limit", 100)


DEFAULT_MAX_TASK_LIMIT = 100 ?

add auto tuner

2e96268

Caozhou1995 added 6 commits June 8, 2023 07:55

fix prune

34ff765

fix sharding prune and mbs candidates

a89f044

fix cfg

a73d1db

fix launch

e487c76

fix launch

86c6ee6

add unittest

6d47943

Caozhou1995 force-pushed the auto_configurator branch from 90ea799 to 6d47943 Compare June 12, 2023 08:39

fix code style

ad767cf

XieYunshen approved these changes Jun 14, 2023

View reviewed changes

zhiqiu approved these changes Jun 14, 2023

View reviewed changes

zhiqiu merged commit e12d286 into PaddlePaddle:develop Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoTuner] Add auto tuner to obtain optima configuration #54460

[AutoTuner] Add auto tuner to obtain optima configuration #54460

Uh oh!

Caozhou1995 commented Jun 8, 2023 •

edited

Loading

Uh oh!

paddle-bot bot commented Jun 8, 2023

Uh oh!

XieYunshen left a comment

Uh oh!

zhiqiu left a comment

Uh oh!

zhiqiu Jun 14, 2023

Uh oh!

zhiqiu Jun 14, 2023

Uh oh!

zhiqiu Jun 14, 2023

Uh oh!

zhiqiu Jun 14, 2023

Uh oh!

Uh oh!

[AutoTuner] Add auto tuner to obtain optima configuration #54460

[AutoTuner] Add auto tuner to obtain optima configuration #54460

Uh oh!

Conversation

Caozhou1995 commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Jun 8, 2023

Uh oh!

XieYunshen left a comment

Choose a reason for hiding this comment

Uh oh!

zhiqiu left a comment

Choose a reason for hiding this comment

Uh oh!

zhiqiu Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

zhiqiu Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

zhiqiu Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

zhiqiu Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Caozhou1995 commented Jun 8, 2023 •

edited

Loading