Production-ready training recipes for state-of-the-art MoE models — DeepSeek-V3, Qwen3, and Mixtral — built on 🚀 Megatron-Core DEV branch.
✅ Performance-tuned configs for H100, B200, and GB200 clusters
✅ Model-specific best practices for training MoE models
✅ One-command launch with sensible defaults
✅ Dry-run mode to validate arguments before submitting jobs
Ready-to-run scripts with optimized configurations:
| Model | Hardware | Scripts |
|---|---|---|
| DeepSeek-V3 | H100, B200, GB200 | best_practice/DeepSeekV3/ |
| Qwen3 | H100 | best_practice/Qwen3/ |
| Mixtral | H100 | best_practice/Mixtral/ |
See best_practice/ for detailed guides.
Install yq for YAML processing (one-time setup):
mkdir -p ~/.local/bin && wget -qO ~/.local/bin/yq https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 && chmod +x ~/.local/bin/yq
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc| Variable | Description | Example |
|---|---|---|
MEGATRON_PATH |
Path to Megatron-LM | /path/to/Megatron-LM |
CONTAINER_IMAGE |
Container image path | /path/to/image.sqsh |
CLUSTER |
Name of the cluster; used to load cluster-specific settings such as data paths | EOS, CW |
WANDB_API_KEY |
(Optional) WandB key | From wandb.ai/authorize |
Dockerfile: dockers/Dockerfile (also available: B200.Dockerfile, GB200.Dockerfile)
Mixtral-8x2B, Mixtral-8x7B, Mixtral-8x22B, DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-V3, Qwen2-57B-A14B, Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen3-Next-80B-A3B
Basic launch:
MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.shWith custom/overwritten parameters:
MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 RUN_TIME=00:60:00 NNODES=64 \
bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm💡 Tip: Dry Run Mode — Preview the generated SLURM script and training command without submitting to the cluster:
DRY_RUN=1 MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.shThis is highly recommended before submitting jobs to verify configurations.
Runtime configs (runtime_configs/benchmarking/runtime.conf):
- Parallelism:
TP,PP,EP,CP,VPP,PP_FIRST,PP_LAST,LAYERS_PER_VP - Batch sizes:
MBS,GBS - Training:
NNODES,RUN_TIME,NUM_LAYERS,SEQ_LEN - MoE:
MOE_TOKEN_DISPATCHER,MOE_GROUPED_GEMM
Cluster configs (cluster_configs/benchmarking/template.conf):
- Slurm:
ACCOUNT,PARTITION,RUN_NAME,CONTAINER_MOUNTS - Paths:
OUTPUT_PATH,DATA_PATH,TOKENIZER_MODEL,LOAD_PATH
watch -n 1 squeue -u $USERFor HF↔MCore conversion, consider MBridge or Megatron-Bridge.
1. Download and convert to BF16:
git lfs install && git clone https://huggingface.co/deepseek-ai/DeepSeek-V3
python inference/fp8_cast_bf16.py --input-fp8-hf-path /input/fp8/path --output-bf16-hf-path /output/bf16/path2. Convert to Megatron legacy checkpoint:
MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh3. Convert to distributed checkpoint:
MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/ckpt \
bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/dist/ckpt --ckpt-convert-format torch_dist --no-save-optimStorage: Legacy ~3.4T, Distributed ~1.4T
- Design Docs - Implementation details for MTP, VPP, EP overlapping, etc.