Release 0.3.0

@parthchadha

🚀 Release v0.3.0

📝 Blog

Our latest blog post shares highlights and progress from recent work—take a look!

✨ Highlights

🏗️ Improved Training Throughput and Scalability via Megatron-Core Backend

In addition to PyT DTensor backend to seamlessly support 🤗HuggingFace models, this release has added Megatron-Core backend("Megatron backend") to enable large scale dense and MoE model training. It includes efficient parallelisms (data, tensor, pipeline, context, expert and sequence) and distributed optimizers for efficient training, and is our recommendation for RL on large model sizes and compute scale.

To use the Megatron backend, ensure you have initialized the submodules of NeMo RL:

git submodule update --init --recursive

You can try out the Megatron backend using predefined configs:

# Example 1 GPU 
uv run examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml

Or by enabling it from the command line:

# Example 1 GPU
uv run examples/run_sft.py policy.megatron_cfg.enabled=True

To learn more about the different backends and their configuration, visit our documentation on Traning Backends.

For FAQ using the Megatron backend, see this section.

⚡ Context Parallelism and Sequence Packing

Users can now train with longer sequences at enhanced GPU utilization via Context Parallelism ("CP") and Sequence Packing support for both Megatron-Core and PyT DTensor backends.

For the Megatron backend, both Context Parallelism and sequence packing can be enabled together:

policy:
  megatron_cfg:
    context_parallel_size: 2
  sequence_packing:
    enabled: True

DTensor backend also supports CP and Sequence Packing, but cannot be used together yet. Progress on this feature is being tracked here #520. In addition, there is also a known issue of CP with sequence parallelism, tracked here #659. For more information about CP and some of its current limitations in the Dtensor backend, visit our documentation.

policy:
  dtensor_cfg:
    context_parallel_size: 2
  # CP and sequence packing cannot be used together (To enable sequence packing, set context_parallel_size=1)
  sequence_packing:
     enabled: False

We recommend sequence packing to avoid extra padding and accelerate your training run, but if your model cannot use sequence packing (e.g., due to unsupported attention kernel), we recommend using dynamic_batching instead (see config). Dynamic batching is mutually exclusive with sequence packing, so it should be enabled on its own.

For more details on sequence packing and dynamic batching and how to use it, refer to our design documentation.

💎 Expanded Model Support

🔹 Qwen3 Support

Full support for Qwen3 model family with optimized configurations is available on the Megatron backend.

Qwen3 dense variants and the smallest MoE variant (Qwen/Qwen3-30B-A3B) is also available in the DTensor backend. If you need full N-d parallelism and the largest scale, we recommend the Megatron backend.

🔹 DeepSeekV3 Support

DeepSeekV3 (671B) is now supported on the Megatron backend. See #591 for more details on how to launch. We are continuing to optimize performance on DeepSeekV3 and other large MoE models, which we hope to land in our next release.

🚀 Async VLLM Engine

We have added Async VLLM Engine (v1) support in v0.3 which enables two important features not possible before:

Multi-node VLLM rollouts (for large MoEs like DSV3)
Pipeline Parallelism

Async engine can be enabled with the following config change:

  generation:
    backend: "vllm"
    vllm_cfg:
      async_engine: true

With Async VLLM Engine enabled, multi-turn rollouts are now much faster since we no longer block at each turn to wait for all the batch elements to complete.

📍 Non-colocated Generation ("Split Placement")

NeMo RL now supports placing the training backend on a different set of GPUs than the generation backend. This is currently supported in the DTensor backend, with support in the Megatron backend coming soon #613

This feature can be useful if:

training and generation have incompatible parallelism/world sizes
the memory after offloading for training or generation is still not low enough

Non-colcated generation can be enabled with the following config changes:

  generation:
    backend: "vllm"
    colocated:
      # true: generation shares training GPUs
      # false: uses dedicated generation resources
      enabled: false
      # only relevant when enabled is false
      resources:
        gpus_per_node: null # Decides num gpus to be dedicated to generation when there is one node in the cluster i.e cluster.num_nodes == 1
        num_nodes: 1 # Decides number of nodes to be dedicated to generation

An example multi-node command is:

# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
    policy.generation.colocated.enabled=false \
    policy.generation.colocated.resources.num_nodes=1 \
    cluster.num_nodes=5 \
    cluster.gpus_per_node=8

Non-colocated generation is also an important prerequisite for our continued work on Async RL.

📊 MLFlow Integration for Experiment Tracking

NeMo RL now supports MLFlow integration for comprehensive experiment tracking and management. This extends our suite of loggers which included Tensorboard and Wandb.

Enable MLFlow tracking in your configuration:

logger:
  mlflow_enabled: true
  mlflow:
    experiment_name: "grpo-dev"
    run_name: "grpo-dev-logger"

⚡ Performance Optimizations

🚀 Refit Optimizations

Multiple improvements to the refit process (weight updates from training to generation backend) lead to a several fold speedup. In large MoE models this has a significant effect on E2E step time. We have measured these optimizations on DeepSeekV3 which has brought down refit time from 850 seconds to 51 seconds (16x improvement). The improvements are particularly beneficial for extra-large models with large TP sizes in vLLM.

The core engineering team is planning on sharing some of the insights and optimization techniques used to achieve this in a blog; so stay tuned!

📊 VLLM CUDA Graphs

In v0.3 we now enable CUDA Graphs in VLLM by default

🚫 FSDP1 Deprecation

NeMo RL has officially removed the original FSDP1 path used for multi-gpu multi-node training in pure Pytorch. For training in pure Pytorch without the Megatron backend, we now recommend using the DTensor path which uses FSDP2 by default and is strictly better in terms of functionality and performance.

For more information on the deprecation and the burn testing done before its removal see #614.

🛠️ Known Issues

Qwen 32B needs minimum 32 nodes on DTensor backend #656
- We have observed increased memory pressure on Qwen 32B requiring a much higher node count than expected.
- This bug does not seem to affect the model on the Megatron backend.
Qwen/Qwen3-30B-A3B needs 8 nodes to run causing overheads due to extra parallelisms
Llama 70B performance on long context (>=32k) is slower than expected on the Megatron-Core backend.
DeepSeekV3 and Qwen3-253B E2E performance is still WIP, with improvements targed for v0.4
On the Megatron backend, training with MoE models using EP>1 and DP>1 can
hang if sequence packing is enabled #718.
- Our recommendation if you are using EP>1 and DP>1 is to disable sequence packing.
DPO does not support sequence packing or dynamic-batching #719.

📊 Release Runs

We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.

To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.

What's Changed

fix: update the comment about why we init in fp32 by @parthchadha in #354
feat: add and log a very rough entropy approximation by @SahilJain314 in #342
fix: fix issues preventing running grpo on volta by @parthchadha in #294
docs: remove license that was erroneously copy-pasted by @terrykong in #357
fix: recipes missing args by @terrykong in #365
test: make dpo functional test threshold higher until flakiness resolved by @terrykong in #371
ci: Migrate to use cuda image as base for container by @chtruong814 in #312
fix: add missing multi-turn, container information in README by @terrykong in #369
fix: Save last checkpoint by @yfw in #368
ci: Add release build stage by @chtruong814 in #361
feat: Handle Gemma3 special cases in code by @yfw in #379
feat: Fixed metric calculation and made all grpo metrics token-level by @SahilJain314 in #373
feat: SFT on OpenMathInstruct-2 by @yfw in #360
feat: add aime24 validation set by @abukharin-nv in #388
feat: add deepscaler guide by @abukharin-nv in #391
feat: dynamic batching for training and log prob stages by @jiemingz in #274
docs: deepscaler guide on sidebar by @terrykong in #401
fix: check if weight update failed and error out immediately by @parthchadha in #398
fix: use validation MBS to get ref policy logprobs when running validation by @ashors1 in #363
docs: Update openmathinstuct sft with 1 epoch results by @yfw in #405
fix: Make dtensor default in 8b config by @parthchadha in #407
fix: logging validation samples was broken and not logging anything by @parthchadha in #406
docs: fix typos in README by @jhinpan in #409
feat: Total worker isolation (avoid imports of worker deps in base environment) by @SahilJain314 in #321
fix: enable dashboard on local clusters by @terrykong in #411
fix: Updated hf home cache dir back to the original one (after clean) by @SahilJain314 in #416
docs: add instructions for how to use the ray debugger by @terrykong in #240
ci: Update cherry-pick workflow to pass semantic PR check by @chtruong814 in #414
fix: fix refit of FusedMoE by @yuki-666 in #351
fix: fixes #377 to rename md files with _ to - by @parthchadha in #420
docs: Enable Version Switcher and Analytics Tracker by @aschilling-nv in #422
docs: Added a small pointer doc to the docker dir to prevent confusion by @SahilJain314 in #430
fix: Fixed a ton of mypy static typing issues by @SahilJain314 in #350
ci: Upload codecov by @chtruong814 in #423
feat: Use a NamedSharding tensor to describe parallelism by @SahilJain314 in #417
ci: Set codecov override_branch to main if from merge queue by @chtruong814 in #437
feat: parametrize GPUS_PER_NODE and CPUS_PER_WORKER in ray.sub by @terrykong in #410
feat: add a megatron extra to support megatron environments by @terrykong in #308
feat: general fsdp2 on non-MoE models + HF TP plan by @yuki-666 in #352
fix: only step scheduler during training by @ashors1 in #446
feat: use async vllm engine (only used in unit tests) by @parthchadha in #418
ci: Small fixes for automation to publish pypi package and bump version 0.3.0rc0 by @ko3n1g in #277
fix: add missing entry dynamic_batching and setting it to False by @terrykong in #455
feat: default to UV_CACHE_DIR from within the container by @terrykong in #427
fix: make math environment more robust to failures by restarting by @terrykong in #457
docs: improve ray debugging instructions by @terrykong in #459
feat: refit speedup by @yuki-666 in #449
ci: Update release workflow to optionally release pypi without github release by @chtruong814 in #461
fix: math-verify intermittent failure due to timeout by @terrykong in #465
feat: add vllm pipeline parallelism and multi node rollout by @parthchadha in #460
fix: [FSDP2] reshard_after_forward=False for root model by @weifengpy in #464
feat: Parallelized worker initialization by @SahilJain314 in #452
fix: Revert "Merge remote-tracking branch 'origin/main' into sahilj/parall… by @parthchadha in #475
feat: remove manual refit param by @yuki-666 in #292
feat: Parallelized worker initialization by @SahilJain314 in #476
ci: Fix GHA template ref after org move by @chtruong814 in #479
fix: Fix incorrect merge to main by @parthchadha in #473
feat: head node in ray.sub becomes schedulable to simplify deployment by @terrykong in #477
fix: async rollouts; incorrect tokens generated by @parthchadha in #478
fix: printing validation results in GRPO by @killershrimp in #470
fix: Add missing 'add_generation_prompt' key to SFT convergence tests by @ashors1 in #484
fix(chore): bump vllm, TE, ray, torch + more performant cuda base by @terrykong in #454
feat: using v1 runtime for async rollouts by @parthchadha in #482
fix: add more checks for seq len as well as a script to check max_model_len by @terrykong in #495
fix: Changes to support ray job submit and prefetch venvs by @hemildesai in #432
feat: token_mult_prob_error sample visualization if above a threshold by @ZhiyuLi-Nvidia in #389
feat: support non-colocated sync vllm by @yuki-666 in #489
feat: Moving everything to 'Policy' and lm_policy for Megatron (removing 'hf') by @SahilJain314 in #511
feat: add context parallel. by @joyang-nv in #450
feat: Add Megatron-LM based training by @SahilJain314 in #517
feat: async ray monitoring now tracks system memory by @terrykong in #349
docs: Fixing some Megatron types and small cleanup by @SahilJain314 in #526
ci: Ensure docker container is removed during test by @chtruong814 in #530
feat: add nsys profiling for for dtensor and vllm workers by @terrykong in #487
feat: Enable SFT and DPO with Megatron backend by @ashors1 in #525
fix: increase test timeout 2hr -> 3hr by @terrykong in #542
fix: fix Ray typing to not use internal package by @pcmoritz in #537
feat: make torch index explicit to support grace-hopper/GH200/aarch64 by @terrykong in #533
docs: enable the mcore instructions by @terrykong in #546
docs: release runs on front page readme by @terrykong in #550
ci: Reduce expected mem usage for sft-llama3.1-8b-instruct-1n8g-fsdp2tp1-long by @chtruong814 in #548
feat: Multi turn async by @parthchadha in #506
fix: fix pytest -k test usage by @parthchadha in #556
fix: remove reference_model_buffers in fsdp2 by @yuki-666 in #558
fix: Add assertion if async is disabled when using pp with vllm by @parthchadha in #565
fix: remove visualization code by @parthchadha in #566
Allow uneven shards for multi-GPU inference in vllm backend by @KiddoZhu in #494
feat: Log code in wandb by @yfw in #175
feat: vllm Model diagnostic test checking long generation quality by @vegaluisjose in #516
fix: add dynamic_batching key to SFT OpenMathInstruct config by @ashors1 in #570
feat: support async in non-colocated by @yuki-666 in #523
fix: correct mcore dtype + assertion on activation_func by @terrykong in #572
fix: move core ray port from 6379 -> 54258 to reduce port collision freq by @terrykong in #574
fix: fix overlap param gather by @ashors1 in #561
docs: fix some typos on nsys/model-quirk pages by @terrykong in #560
feat: Add megatron to hf converter by @ashors1 in #555
docs: Add a note on supported backends by @ashors1 in #553
feat: Support pass@k by @peri044 in #536
fix: Megatron config fixes by @SahilJain314 in #576
docs: move training backends section by @ashors1 in #580
docs: Add missing arguments to DeepScaler evaluation by @butsugiri in #502
fix: prevent divisible error by dropping last batch in loader by @wedu-nvidia in #583
feat: improve worker group args/kwargs by @yuki-666 in #539
fix: update gemma3 prefix by @ashors1 in #585
fix: Added copyright to functest by @SahilJain314 in #584
chore: Update github url after org transfer by @chtruong814 in #512
feat: add OpenAI format dataset for SFT by @AtsunoriFujita in #485
fix: load HF model only on rank 0 by @parthchadha in #544
feat: supports evaluation of multiple-choice benchmarks by @xxman-google in #559
fix: enable expandable segments for hopper+ by @parthchadha in #594
feat: Enable vLLM cudagraphs by @jiemingz in #498
docs: Update guide to include minimum compute requirement by @abukharin-nv in #505
fix: skip HelpSteer3 unit test if downloading failed by @yuki-666 in #612
feat: optimize get logprobs when cp enabled. by @joyang-nv in #528
enable mcore rope fusion by @jiemingz in #608
fix: fix non-colocated with vllm tp>1 by @yuki-666 in #601
feat: Refit: reduce the number of IPC calls by packing weights by @guyueh1 in #589
feat: add flash-attn==2.7.4.post1 to backend dependencies by @terrykong in #622
fix: Fix crash for logprob error plot by @yfw in #623
refactor: remove fsdp1 path by @yuki-666 in #614
fix: fix a answer parsing bug in MMLU-Pro. by @xxman-google in #598
feat: add MMMLU eval benchmark. by @xxman-google in #596
fix: pytest_sessionfinish hook in case there is no _unit_test_data. by @ffrujeri in #628
fix: Don't call broadcast on dtensor by @parthchadha in #627
fix: Fix eval when using async engine by @parthchadha in #626
feat: Megatron MoE Support by @yfw in #590
chore: exclude ray.remote from coverage by @terrykong in #624
feat: guide to configure custom vllm version by @terrykong in #529
feat: Deepseek Support by @yfw in #591
feat: decouple checkpointing from validation by @ashors1 in #575
feat: dynamically detect --gres=gpu:8 arg to work on clusters that don't need it by @terrykong in #642
fix: fix nccl P2P initialization error for non-colocated by @Dazz993 in #636
fix: Mcore: Added functional grpo test and typing fixes by @SahilJain314 in #527
feat: plumb environment variables to RayWorkerGroup by @ashors1 in #631
feat: Qwen3 support by @ashors1 in #592
fix: Fix megatron llama3.1-8b config by @yfw in #652
fix: update qwen32b config by @yuki-666 in #658
fix: Make trust_remote_code default true in checkpoint by @parthchadha in #663
feat: add script to redact hparam paths from tensorboard logs by @terrykong in #347
test: add a unit test that verifies that the correct keys are present in configs by @ashors1 in #587
docs: Add GitHub icon and link to top bar by @aschilling-nv in #669
fix: Tie weights after set_model_state_dict if required by @parthchadha in #666
feat: optimize refit by reducing set of IPC handles sent to each device by @ZhiyuLi-Nvidia in #634
fix: adjust temperature scaling logic based on engine version by @jubick1337 in #660
feat: introduce megatron checkpoint dir precedence by @terrykong in #665
feat: optimize refit by preparing refit info ahead of time by @yuki-666 in #638
docs: update converter path in README. by @xxman-google in #672
fix: make mcore lr scheduler configuration consistent with dtensor by @ashors1 in #681
fix: fix mcore LR increment by @ashors1 in #685
fix: Megatron config updates to avoid OOM by @ashors1 in #687
fix: upgrade datasets to fix squad download by @ashors1 in #692
fix: fix lr scheduler for config that was missed in #681 by @ashors1 in #693
fix: Fix gemma models broken by HF update by @yfw in #676
chore: add CP+SP (sequence parallel) assertion in DTensor worker by @yuki-666 in #689
feat: MLFlow Integration for experiment tracking by @terrykong in #697
fix: Fix activation checkpointing for mcore path by @yfw in #703
feat: Enable Context Parallelism and Sequence Packing for MCore and Dtensor by @SahilJain314 in #704
fix: SyntaxWarning: invalid escape sequence '\s' by @RayenTian in #705
chore: Bump 0.2.1 -> 0.3.0 by @terrykong in #710
cp: fix: unset TP and PP in sft 1 GPU config (717) into r0.3.0 by @chtruong814 in #726
cp: docs: remove doc duplicated (721) into r0.3.0 by @chtruong814 in #733
cp: fix: guard DPO against dynamic batching and sequence packing (730) into r0.3.0 by @chtruong814 in #735
cp: fix: remove dynamic batching from 8B llama dtensor config (#728) by @terrykong in #736
cp: docs: Update docs to include submodule instructions (725) into r0.3.0 by @chtruong814 in #737
cp: docs: Added docs for sequence packing and dynamic batching (729) into r0.3.0 by @chtruong814 in #753
cp: fix: Use the conditional temperature scaling in get_logprobs as well (714) into r0.3.0 by @chtruong814 in #752
cp: fix: Disable sequence packing in qwen moe config to prevent hang (750) into r0.3.0 by @chtruong814 in #754
cp: docs: fix frontpage outdated eval docs (738) into r0.3.0 by @chtruong814 in #756

New Contributors

@jhinpan made their first contribution in #409
@weifengpy made their first contribution in #464
@killershrimp made their first contribution in #470
@pcmoritz made their first contribution in #537
@vegaluisjose made their first contribution in #516
@peri044 made their first contribution in #536
@butsugiri made their first contribution in #502
@wedu-nvidia made their first contribution in #583
@AtsunoriFujita made their first contribution in #485
@xxman-google made their first contribution in #559
@guyueh1 made their first contribution in #589
@ffrujeri made their first contribution in #628
@Dazz993 made their first contribution in #636
@jubick1337 made their first contribution in #660
@RayenTian made their first contribution in #705

Full Changelog: v0.2.1...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!