Megatron MoE Model Zoo

Production-ready training recipes for state-of-the-art MoE models — DeepSeek-V3, Qwen3, and Mixtral — built on 🚀 Megatron-Core DEV branch.

✅ Performance-tuned configs for H100, B200, and GB200 clusters
✅ Model-specific best practices for training MoE models ✅ One-command launch with sensible defaults
✅ Dry-run mode to validate arguments before submitting jobs

Best Practices

Ready-to-run scripts with optimized configurations:

Model	Hardware	Scripts
DeepSeek-V3	H100, B200, GB200	`best_practice/DeepSeekV3/`
Qwen3	H100	`best_practice/Qwen3/`
Mixtral	H100	`best_practice/Mixtral/`

See best_practice/ for detailed guides.

Quick Start

Prerequisites

Install yq for YAML processing (one-time setup):

mkdir -p ~/.local/bin && wget -qO ~/.local/bin/yq https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 && chmod +x ~/.local/bin/yq
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

Environment Variables

Variable	Description	Example
`MEGATRON_PATH`	Path to Megatron-LM	`/path/to/Megatron-LM`
`CONTAINER_IMAGE`	Container image path	`/path/to/image.sqsh`
`CLUSTER`	Name of the cluster; used to load cluster-specific settings such as data paths	`EOS`, `CW`
`WANDB_API_KEY`	(Optional) WandB key	From wandb.ai/authorize

Container

Dockerfile: dockers/Dockerfile (also available: B200.Dockerfile, GB200.Dockerfile)

Performance Benchmarking

Supported Models

Mixtral-8x2B, Mixtral-8x7B, Mixtral-8x22B, DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-V3, Qwen2-57B-A14B, Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen3-Next-80B-A3B

Launch

Basic launch:

MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh

With custom/overwritten parameters:

MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 RUN_TIME=00:60:00 NNODES=64 \
  bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm

💡 Tip: Dry Run Mode — Preview the generated SLURM script and training command without submitting to the cluster:
DRY_RUN=1 MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh
This is highly recommended before submitting jobs to verify configurations.

Configuration

Runtime configs (runtime_configs/benchmarking/runtime.conf):

Parallelism: TP, PP, EP, CP, VPP, PP_FIRST, PP_LAST, LAYERS_PER_VP
Batch sizes: MBS, GBS
Training: NNODES, RUN_TIME, NUM_LAYERS, SEQ_LEN
MoE: MOE_TOKEN_DISPATCHER, MOE_GROUPED_GEMM

Cluster configs (cluster_configs/benchmarking/template.conf):

Slurm: ACCOUNT, PARTITION, RUN_NAME, CONTAINER_MOUNTS
Paths: OUTPUT_PATH, DATA_PATH, TOKENIZER_MODEL, LOAD_PATH

Job Monitoring

watch -n 1 squeue -u $USER

Checkpoint Conversion

For HF↔MCore conversion, consider MBridge or Megatron-Bridge.

DeepSeek-V3

1. Download and convert to BF16:

git lfs install && git clone https://huggingface.co/deepseek-ai/DeepSeek-V3
python inference/fp8_cast_bf16.py --input-fp8-hf-path /input/fp8/path --output-bf16-hf-path /output/bf16/path

2. Convert to Megatron legacy checkpoint:

MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh

3. Convert to distributed checkpoint:

MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/ckpt \
  bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/dist/ckpt --ckpt-convert-format torch_dist --no-save-optim

Storage: Legacy ~3.4T, Distributed ~1.4T

References

Design Docs - Implementation details for MTP, VPP, EP overlapping, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
best_practice		best_practice
ckpt_convert_scripts		ckpt_convert_scripts
cluster_configs/benchmarking		cluster_configs/benchmarking
design_docs		design_docs
dockers		dockers
misc/tools		misc/tools
model_configs/benchmarking		model_configs/benchmarking
runtime_configs/benchmarking		runtime_configs/benchmarking
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
interactive_benchmarking.sh		interactive_benchmarking.sh
sbatch_benchmarking.sh		sbatch_benchmarking.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Megatron MoE Model Zoo

Best Practices

Quick Start

Prerequisites

Environment Variables

Container

Performance Benchmarking

Supported Models

Launch

Configuration

Job Monitoring

Checkpoint Conversion

DeepSeek-V3

References

About

Uh oh!

Releases

Packages

Contributors 8

Languages

License

yanring/Megatron-MoE-ModelZoo

Folders and files

Latest commit

History

Repository files navigation

Megatron MoE Model Zoo

Best Practices

Quick Start

Prerequisites

Environment Variables

Container

Performance Benchmarking

Supported Models

Launch

Configuration

Job Monitoring

Checkpoint Conversion

DeepSeek-V3

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages