You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of pipeline templates.
It is the first training framework that realizes:
Dynamic reconfiguration: Oobleck can reconfigure distributed training configurtation without restart after failures.
Pipeline template instantiation: Oobleck pre-generates a set of pipeline templates, and then combine their instantiated pipelines to form a distributed execution plan. The same set of pipeline templates is reused and different pipelines are instantiated after failures.
Getting Started
Install
Use pip to install Oobleck:
pip install oobleck
Oobleck relies on cornstarch for pipeline template and Colossal-AI for training backend.
Optionally, install apex, xformers and flash-attn to boost throughput (follow instructions in each README).
@inproceedings{oobleck-sosp23,
title = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
author = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
year = {2023},
}