CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 183
0.19.12
Compare
8732138
Clusters
Simplified use of MPI
startup_order
and stop_criteria
New run configuration properties are introduced:
startup_order: any/master-first/workers-first
specifies the order in which master and workers jobs are started.stop_criteria: all-done/master-done
specifies the criteria when a multi-node run should be considered finished.
These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun
, so you'd use startup_order: workers-first
. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done
and dstack
won't wait for workers to exit.
DSTACK_MPI_HOSTFILE
dstack
now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE
environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE
.
Below is the updated NCCL tests example.

CLI
We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS
column in dstack ps
and dstack apply
displays a status code which is easy to understand why run or job was terminated.

Examples
Distributed training
TRL
The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.
Axolotl
The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack
.
What's changed
- [Feature] Update
.gitignore
logic to catch more cases by @colinjc in #2695 - [Bug] Increase
upload_code
client timeout by @r4victor in #2709 - [Bug] Fix missing
apt-get update
by @r4victor in #2710 - [Internal]: Update git hooks and
package.json
by @olgenn in #2706 - [Examples] Add distributed Axolotl and TRL example by @Bihan in #2703
- [Docs] Update
dstack-proxy
contributing guide by @jvstme in #2683 - [Feature] Implement
DSTACK_MPI_HOSTFILE
by @r4victor in #2718 - [Feature] Implement
startup_order
andstop_criteria
by @r4victor in #2714 - [Bug] Fix CLI exiting while master starting by @r4victor in #2720
- [Examples] Simplify NCCL tests example by @r4victor in #2723
- [Examples] Update TRL Single Node example to uv by @Bihan in #2715
- [Bug] Fix backward compatibility when creating fleets by @jvstme in #2727
- [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in #2716
- [Bug] Fix relative paths in
dstack apply --repo
by @jvstme in #2733 - [Internal]: Drop hardcoded regions from the backend template by @jvstme in #2734
- [Internal]: Update backend template to match
ruff
formatting by @jvstme in #2735
Full changelog: 0.19.11...0.19.12