This guide provides detailed instructions, best practices, and optimized configurations for testing Mixtral, DeepSeek, and Qwen series models using the Megatron-Core framework to achieve optimal performance and reliability.
- DeepSeek-V3 best practices in a single command
- Current includes H100, B200, and Long Context. GB200 config coming soon.
- Container Setup
- Design Docs
- Environment Setup
- Performance Benchmarking
- DeepSeek Checkpoint Conversion
- Dockerfile: dockers/Dockerfile
Please refer to design_docs.
Before entering the container, you need to install yq
to process .yaml
configuration files.
Click here to view installation steps.
-
Create a local bin directory:
mkdir -p ~/.local/bin
-
Download the
yq
executable:wget https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 -O ~/.local/bin/yq
-
Make it executable:
chmod +x ~/.local/bin/yq
-
Add the local bin directory to your
PATH
in~/.bashrc
:export PATH="$HOME/.local/bin:$PATH"
-
Apply the changes:
source ~/.bashrc
Before running any scripts, you need to set up the following environment variables:
export WANDB_API_KEY="your_wandb_api_key_here"
export MEGATRON_PATH="/path/to/your/megatron/directory"
export MCORE_RELEASE_VERSION="0.13"
export CONTAINER_IMAGE="/path/to/container/image.sqsh"
export CLUSTER="your_cluster_name"
WANDB_API_KEY
: Your Weights & Biases API key for experiment tracking.- Get your key from wandb.ai/authorize.
MEGATRON_PATH
: Absolute path to your Megatron-MoE installation directory.- Example:
path/to/Megatron-LM
- Example:
MCORE_RELEASE_VERSION
: Version of Megatron-Core to use.- Currently recommended:
0.13
- Currently recommended:
CONTAINER_IMAGE
: Path to the container image file (.sqsh
).- Example:
path/to/container/image.sqsh
- Example:
CLUSTER
: Name of your cluster environment (e.g.,EOS
,CW
).
For performance benchmarking, you can launch scripts either with sbatch
via sbatch_benchmarking.sh
or on an interactive node via interactive_benchmarking.sh
.
-
MODEL
- This is a required environment variable that must be set in your script or command.
- Predefined models include:
Mixtral-8x2B
,Mixtral-8x7B
,Mixtral-8x22B
,DeepSeek-V2
,DeepSeek-V2-Lite
,DeepSeek-V3
,DeepSeek-V3-Lite
, andQwen2-57B-A14B
.
-
CLUSTER
,MCORE_RELEASE_VERSION
, andMEGATRON_PATH
- These required variables must be defined in your script or command for proper execution.
-
CONTAINER_IMAGE
-
Using WandB for Experiment Tracking
Click here to view WandB setup instructions.
- To use WandB for experiment tracking, set
WANDB_API_KEY
with your key from wandb.ai/authorize. It is highly recommended to addexport WANDB_API_KEY="your_own_wandb_api_key"
to your~/.bashrc
. - If you do not wish to use WandB, comment out the following lines in your model's
.yaml
configuration file:# --wandb-project: wandb_project_name # --wandb-exp-name: wandb_experiment_name
- To use WandB for experiment tracking, set
All model-specific runner configurations can be adjusted through runtime_configs/benchmarking/runtime.conf
or via the benchmarking command.
-
Available Model-Specific Runner Configurations
Click here to view available model-specific benchmarking configurations.
- Parallel Mappings:
TP
,PP
,EP
,CP
,VPP
,PP_FIRST
,PP_LAST
, andLAYERS_PER_VP
- Batch Sizes:
MBS
andGBS
- Model Architecture:
NUM_LAYERS
- MoE Configurations:
MOE_TOKEN_DISPATCHER
,MOE_GROUPED_GEMM
, and--moe-extended-ep
- Training Configurations:
NNODES
,RUN_TIME
, andPRETRAIN
. Note that specifying a shorter run time may improve your job's priority in the Slurm queue. - Data Configurations:
SEQ_LEN
andDATASET
- Parallel Mappings:
-
All available optimial configurations are listed in
runtime_configs/benchmarking/runtime.conf
.
All cluster configurations can be customized either through cluster_configs/benchmarking/your_own_cluster.conf
or via the benchmarking command. For guidance on creating your own cluster configurations, refer to the template provided in cluster_configs/benchmarking/template.conf
.
- Required Cluster-Specific Slurm Settings:
ACCOUNT
,PARTITION
,RUN_NAME
, andCONTAINER_MOUNTS
- Required Cluster-Specific Paths:
OUTPUT_PATH
,DATA_PATH
,TOKENIZER_MODEL
, andLOAD_PATH
- To benchmark a model from scratch with preconfigured parameters:
# Example for DeepSeek-V3 MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh
- To train a model with custom parameters:
# Example for DeepSeek-V3 MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 PP_FIRST=8 PP_LAST=5 RUN_TIME=00:60:00 NNODES=64 bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm
- To monitor your jobs, use
squeue -u $USER
for a one-time status check orwatch -n 1 squeue -u $USER
for continuous monitoring. For detailed logging, refer to the WandB dashboard.
| Please try MBridge and Megatron-Bridge for better HF<->MCore conversion support.
Download the DeepSeek-V3 checkpoint from HuggingFace:
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3
The downloaded checkpoint is in FP8 format. Run the following command to convert it to BF16 format, using this script:
python inference/fp8_cast_bf16.py --input-fp8-hf-path /your/input/fp8/hf/path --output-bf16-hf-path /your/output/bf16/hf/path
To convert the BF16 HuggingFace checkpoint to a Megatron legacy checkpoint, execute the following command:
# Example for DeepSeek-V3
MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh
Finally, run this command to convert the legacy checkpoint into a distributed checkpoint:
MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/checkpoint bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/save/distributed/checkpoint --ckpt-convert-format torch_dist --no-save-optim
For reference, after conversion, the legacy checkpoint is approximately 3.4T, and the distributed checkpoint is about 1.4T.