CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 82
Compare
b1c86a8
π Release v0.3.0
π Blog
Our latest blog post shares highlights and progress from recent workβtake a look!
β¨ Highlights
ποΈ Improved Training Throughput and Scalability via Megatron-Core Backend
In addition to PyT DTensor backend to seamlessly support π€HuggingFace models, this release has added Megatron-Core backend("Megatron backend") to enable large scale dense and MoE model training. It includes efficient parallelisms (data, tensor, pipeline, context, expert and sequence) and distributed optimizers for efficient training, and is our recommendation for RL on large model sizes and compute scale.
To use the Megatron backend, ensure you have initialized the submodules of NeMo RL:
git submodule update --init --recursive
You can try out the Megatron backend using predefined configs:
# Example 1 GPU
uv run examples/run_grpo_math.py --config=examples/configs/grpo_math_1B_megatron.yaml
Or by enabling it from the command line:
# Example 1 GPU
uv run examples/run_sft.py policy.megatron_cfg.enabled=True
To learn more about the different backends and their configuration, visit our documentation on Traning Backends.
For FAQ using the Megatron backend, see this section.
β‘ Context Parallelism and Sequence Packing
Users can now train with longer sequences at enhanced GPU utilization via Context Parallelism ("CP") and Sequence Packing support for both Megatron-Core and PyT DTensor backends.
For the Megatron backend, both Context Parallelism and sequence packing can be enabled together:
policy:
megatron_cfg:
context_parallel_size: 2
sequence_packing:
enabled: True
DTensor backend also supports CP and Sequence Packing, but cannot be used together yet. Progress on this feature is being tracked here #520. In addition, there is also a known issue of CP with sequence parallelism, tracked here #659. For more information about CP and some of its current limitations in the Dtensor backend, visit our documentation.
policy:
dtensor_cfg:
context_parallel_size: 2
# CP and sequence packing cannot be used together (To enable sequence packing, set context_parallel_size=1)
sequence_packing:
enabled: False
We recommend sequence packing to avoid extra padding and accelerate your training run, but if your model cannot use sequence packing (e.g., due to unsupported attention kernel), we recommend using dynamic_batching
instead (see config). Dynamic batching is mutually exclusive with sequence packing, so it should be enabled on its own.
For more details on sequence packing and dynamic batching and how to use it, refer to our design documentation.
π Expanded Model Support
πΉ Qwen3 Support
Full support for Qwen3 model family with optimized configurations is available on the Megatron backend.
Qwen3 dense variants and the smallest MoE variant (Qwen/Qwen3-30B-A3B) is also available in the DTensor backend. If you need full N-d parallelism and the largest scale, we recommend the Megatron backend.
πΉ DeepSeekV3 Support
DeepSeekV3 (671B) is now supported on the Megatron backend. See #591 for more details on how to launch. We are continuing to optimize performance on DeepSeekV3 and other large MoE models, which we hope to land in our next release.
π Async VLLM Engine
We have added Async VLLM Engine (v1) support in v0.3 which enables two important features not possible before:
- Multi-node VLLM rollouts (for large MoEs like DSV3)
- Pipeline Parallelism
Async engine can be enabled with the following config change:
generation:
backend: "vllm"
vllm_cfg:
async_engine: true
With Async VLLM Engine enabled, multi-turn rollouts are now much faster since we no longer block at each turn to wait for all the batch elements to complete.
π Non-colocated Generation ("Split Placement")
NeMo RL now supports placing the training backend on a different set of GPUs than the generation backend. This is currently supported in the DTensor backend, with support in the Megatron backend coming soon #613
This feature can be useful if:
- training and generation have incompatible parallelism/world sizes
- the memory after offloading for training or generation is still not low enough
Non-colcated generation can be enabled with the following config changes:
generation:
backend: "vllm"
colocated:
# true: generation shares training GPUs
# false: uses dedicated generation resources
enabled: false
# only relevant when enabled is false
resources:
gpus_per_node: null # Decides num gpus to be dedicated to generation when there is one node in the cluster i.e cluster.num_nodes == 1
num_nodes: 1 # Decides number of nodes to be dedicated to generation
An example multi-node command is:
# 5 nodes with 8GPUs, 4 nodes for train and 1 node for inference
uv run python examples/run_grpo_math.py \
policy.generation.colocated.enabled=false \
policy.generation.colocated.resources.num_nodes=1 \
cluster.num_nodes=5 \
cluster.gpus_per_node=8
Non-colocated generation is also an important prerequisite for our continued work on Async RL.
π MLFlow Integration for Experiment Tracking
NeMo RL now supports MLFlow integration for comprehensive experiment tracking and management. This extends our suite of loggers which included Tensorboard and Wandb.
Enable MLFlow tracking in your configuration:
logger:
mlflow_enabled: true
mlflow:
experiment_name: "grpo-dev"
run_name: "grpo-dev-logger"
β‘ Performance Optimizations
π Refit Optimizations
Multiple improvements to the refit process (weight updates from training to generation backend) lead to a several fold speedup. In large MoE models this has a significant effect on E2E step time. We have measured these optimizations on DeepSeekV3 which has brought down refit time from 850 seconds to 51 seconds (16x improvement). The improvements are particularly beneficial for extra-large models with large TP sizes in vLLM.
The core engineering team is planning on sharing some of the insights and optimization techniques used to achieve this in a blog; so stay tuned!
π VLLM CUDA Graphs
In v0.3 we now enable CUDA Graphs in VLLM by default
π« FSDP1 Deprecation
NeMo RL has officially removed the original FSDP1 path used for multi-gpu multi-node training in pure Pytorch. For training in pure Pytorch without the Megatron backend, we now recommend using the DTensor path which uses FSDP2 by default and is strictly better in terms of functionality and performance.
For more information on the deprecation and the burn testing done before its removal see #614.
π οΈ Known Issues
- Qwen 32B needs minimum 32 nodes on DTensor backend #656
- We have observed increased memory pressure on Qwen 32B requiring a much higher node count than expected.
- This bug does not seem to affect the model on the Megatron backend.
Qwen/Qwen3-30B-A3B
needs 8 nodes to run causing overheads due to extra parallelisms- Llama 70B performance on long context (>=32k) is slower than expected on the Megatron-Core backend.
- DeepSeekV3 and Qwen3-253B E2E performance is still WIP, with improvements targed for v0.4
- On the Megatron backend, training with MoE models using EP>1 and DP>1 can
hang if sequence packing is enabled #718.- Our recommendation if you are using EP>1 and DP>1 is to disable sequence packing.
- DPO does not support sequence packing or dynamic-batching #719.
π Release Runs
We have provided Tensorboard logs to release runs to give you a head start on what to expect from our recipes.
To view these Tensorboard logs easily, we've provided a Google Collab to download and serve the Tensorboard logs.
What's Changed
- fix: update the comment about why we init in fp32 by @parthchadha in #354
- feat: add and log a very rough entropy approximation by @SahilJain314 in #342
- fix: fix issues preventing running grpo on volta by @parthchadha in #294
- docs: remove license that was erroneously copy-pasted by @terrykong in #357
- fix: recipes missing args by @terrykong in #365
- test: make dpo functional test threshold higher until flakiness resolved by @terrykong in #371
- ci: Migrate to use cuda image as base for container by @chtruong814 in #312
- fix: add missing multi-turn, container information in README by @terrykong in #369
- fix: Save last checkpoint by @yfw in #368
- ci: Add release build stage by @chtruong814 in #361
- feat: Handle Gemma3 special cases in code by @yfw in #379
- feat: Fixed metric calculation and made all grpo metrics token-level by @SahilJain314 in #373
- feat: SFT on OpenMathInstruct-2 by @yfw in #360
- feat: add aime24 validation set by @abukharin-nv in #388
- feat: add deepscaler guide by @abukharin-nv in #391
- feat: dynamic batching for training and log prob stages by @jiemingz in #274
- docs: deepscaler guide on sidebar by @terrykong in #401
- fix: check if weight update failed and error out immediately by @parthchadha in #398
- fix: use validation MBS to get ref policy logprobs when running validation by @ashors1 in #363
- docs: Update openmathinstuct sft with 1 epoch results by @yfw in #405
- fix: Make dtensor default in 8b config by @parthchadha in #407
- fix: logging validation samples was broken and not logging anything by @parthchadha in #406
- docs: fix typos in README by @jhinpan in #409
- feat: Total worker isolation (avoid imports of worker deps in base environment) by @SahilJain314 in #321
- fix: enable dashboard on local clusters by @terrykong in #411
- fix: Updated hf home cache dir back to the original one (after clean) by @SahilJain314 in #416
- docs: add instructions for how to use the ray debugger by @terrykong in #240
- ci: Update cherry-pick workflow to pass semantic PR check by @chtruong814 in #414
- fix: fix refit of FusedMoE by @yuki-666 in #351
- fix: fixes #377 to rename md files with _ to - by @parthchadha in #420
- docs: Enable Version Switcher and Analytics Tracker by @aschilling-nv in #422
- docs: Added a small pointer doc to the docker dir to prevent confusion by @SahilJain314 in #430
- fix: Fixed a ton of mypy static typing issues by @SahilJain314 in #350
- ci: Upload codecov by @chtruong814 in #423
- feat: Use a NamedSharding tensor to describe parallelism by @SahilJain314 in #417
- ci: Set codecov override_branch to main if from merge queue by @chtruong814 in #437
- feat: parametrize GPUS_PER_NODE and CPUS_PER_WORKER in ray.sub by @terrykong in #410
- feat: add a megatron extra to support megatron environments by @terrykong in #308
- feat: general fsdp2 on non-MoE models + HF TP plan by @yuki-666 in #352
- fix: only step scheduler during training by @ashors1 in #446
- feat: use async vllm engine (only used in unit tests) by @parthchadha in #418
- ci: Small fixes for automation to publish pypi package and bump version 0.3.0rc0 by @ko3n1g in #277
- fix: add missing entry dynamic_batching and setting it to False by @terrykong in #455
- feat: default to UV_CACHE_DIR from within the container by @terrykong in #427
- fix: make math environment more robust to failures by restarting by @terrykong in #457
- docs: improve ray debugging instructions by @terrykong in #459
- feat: refit speedup by @yuki-666 in #449
- ci: Update release workflow to optionally release pypi without github release by @chtruong814 in #461
- fix: math-verify intermittent failure due to timeout by @terrykong in #465
- feat: add vllm pipeline parallelism and multi node rollout by @parthchadha in #460
- fix: [FSDP2] reshard_after_forward=False for root model by @weifengpy in #464
- feat: Parallelized worker initialization by @SahilJain314 in #452
- fix: Revert "Merge remote-tracking branch 'origin/main' into sahilj/parall⦠by @parthchadha in #475
- feat: remove manual refit param by @yuki-666 in #292
- feat: Parallelized worker initialization by @SahilJain314 in #476
- ci: Fix GHA template ref after org move by @chtruong814 in #479
- fix: Fix incorrect merge to main by @parthchadha in #473
- feat: head node in ray.sub becomes schedulable to simplify deployment by @terrykong in #477
- fix: async rollouts; incorrect tokens generated by @parthchadha in #478
- fix: printing validation results in GRPO by @killershrimp in #470
- fix: Add missing 'add_generation_prompt' key to SFT convergence tests by @ashors1 in #484
- fix(chore): bump vllm, TE, ray, torch + more performant cuda base by @terrykong in #454
- feat: using v1 runtime for async rollouts by @parthchadha in #482
- fix: add more checks for seq len as well as a script to check
max_model_len
by @terrykong in #495 - fix: Changes to support
ray job submit
and prefetch venvs by @hemildesai in #432 - feat: token_mult_prob_error sample visualization if above a threshold by @ZhiyuLi-Nvidia in #389
- feat: support non-colocated sync vllm by @yuki-666 in #489
- feat: Moving everything to 'Policy' and lm_policy for Megatron (removing 'hf') by @SahilJain314 in #511
- feat: add context parallel. by @joyang-nv in #450
- feat: Add Megatron-LM based training by @SahilJain314 in #517
- feat: async ray monitoring now tracks system memory by @terrykong in #349
- docs: Fixing some Megatron types and small cleanup by @SahilJain314 in #526
- ci: Ensure docker container is removed during test by @chtruong814 in #530
- feat: add nsys profiling for for dtensor and vllm workers by @terrykong in #487
- feat: Enable SFT and DPO with Megatron backend by @ashors1 in #525
- fix: increase test timeout 2hr -> 3hr by @terrykong in #542
- fix: fix Ray typing to not use internal package by @pcmoritz in #537
- feat: make torch index explicit to support grace-hopper/GH200/aarch64 by @terrykong in #533
- docs: enable the mcore instructions by @terrykong in #546
- docs: release runs on front page readme by @terrykong in #550
- ci: Reduce expected mem usage for sft-llama3.1-8b-instruct-1n8g-fsdp2tp1-long by @chtruong814 in #548
- feat: Multi turn async by @parthchadha in #506
- fix: fix pytest -k test usage by @parthchadha in #556
- fix: remove reference_model_buffers in fsdp2 by @yuki-666 in #558
- fix: Add assertion if async is disabled when using pp with vllm by @parthchadha in #565
- fix: remove visualization code by @parthchadha in #566
- Allow uneven shards for multi-GPU inference in vllm backend by @KiddoZhu in #494
- feat: Log code in wandb by @yfw in #175
- feat: vllm Model diagnostic test checking long generation quality by @vegaluisjose in #516
- fix: add dynamic_batching key to SFT OpenMathInstruct config by @ashors1 in #570
- feat: support async in non-colocated by @yuki-666 in #523
- fix: correct mcore dtype + assertion on activation_func by @terrykong in #572
- fix: move core ray port from 6379 -> 54258 to reduce port collision freq by @terrykong in #574
- fix: fix overlap param gather by @ashors1 in #561
- docs: fix some typos on nsys/model-quirk pages by @terrykong in #560
- feat: Add megatron to hf converter by @ashors1 in #555
- docs: Add a note on supported backends by @ashors1 in #553
- feat: Support pass@k by @peri044 in #536
- fix: Megatron config fixes by @SahilJain314 in #576
- docs: move training backends section by @ashors1 in #580
- docs: Add missing arguments to DeepScaler evaluation by @butsugiri in #502
- fix: prevent divisible error by dropping last batch in loader by @wedu-nvidia in #583
- feat: improve worker group args/kwargs by @yuki-666 in #539
- fix: update gemma3 prefix by @ashors1 in #585
- fix: Added copyright to functest by @SahilJain314 in #584
- chore: Update github url after org transfer by @chtruong814 in #512
- feat: add OpenAI format dataset for SFT by @AtsunoriFujita in #485
- fix: load HF model only on rank 0 by @parthchadha in #544
- feat: supports evaluation of multiple-choice benchmarks by @xxman-google in #559
- fix: enable expandable segments for hopper+ by @parthchadha in #594
- feat: Enable vLLM cudagraphs by @jiemingz in #498
- docs: Update guide to include minimum compute requirement by @abukharin-nv in #505
- fix: skip HelpSteer3 unit test if downloading failed by @yuki-666 in #612
- feat: optimize get logprobs when cp enabled. by @joyang-nv in #528
- enable mcore rope fusion by @jiemingz in #608
- fix: fix non-colocated with vllm tp>1 by @yuki-666 in #601
- feat: Refit: reduce the number of IPC calls by packing weights by @guyueh1 in #589
- feat: add flash-attn==2.7.4.post1 to backend dependencies by @terrykong in #622
- fix: Fix crash for logprob error plot by @yfw in #623
- refactor: remove fsdp1 path by @yuki-666 in #614
- fix: fix a answer parsing bug in MMLU-Pro. by @xxman-google in #598
- feat: add MMMLU eval benchmark. by @xxman-google in #596
- fix: pytest_sessionfinish hook in case there is no _unit_test_data. by @ffrujeri in #628
- fix: Don't call broadcast on dtensor by @parthchadha in #627
- fix: Fix eval when using async engine by @parthchadha in #626
- feat: Megatron MoE Support by @yfw in #590
- chore: exclude ray.remote from coverage by @terrykong in #624
- feat: guide to configure custom vllm version by @terrykong in #529
- feat: Deepseek Support by @yfw in #591
- feat: decouple checkpointing from validation by @ashors1 in #575
- feat: dynamically detect --gres=gpu:8 arg to work on clusters that don't need it by @terrykong in #642
- fix: fix nccl P2P initialization error for non-colocated by @Dazz993 in #636
- fix: Mcore: Added functional grpo test and typing fixes by @SahilJain314 in #527
- feat: plumb environment variables to RayWorkerGroup by @ashors1 in #631
- feat: Qwen3 support by @ashors1 in #592
- fix: Fix megatron llama3.1-8b config by @yfw in #652
- fix: update qwen32b config by @yuki-666 in #658
- fix: Make trust_remote_code default true in checkpoint by @parthchadha in #663
- feat: add script to redact hparam paths from tensorboard logs by @terrykong in #347
- test: add a unit test that verifies that the correct keys are present in configs by @ashors1 in #587
- docs: Add GitHub icon and link to top bar by @aschilling-nv in #669
- fix: Tie weights after set_model_state_dict if required by @parthchadha in #666
- feat: optimize refit by reducing set of IPC handles sent to each device by @ZhiyuLi-Nvidia in #634
- fix: adjust temperature scaling logic based on engine version by @jubick1337 in #660
- feat: introduce megatron checkpoint dir precedence by @terrykong in #665
- feat: optimize refit by preparing refit info ahead of time by @yuki-666 in #638
- docs: update converter path in README. by @xxman-google in #672
- fix: make mcore lr scheduler configuration consistent with dtensor by @ashors1 in #681
- fix: fix mcore LR increment by @ashors1 in #685
- fix: Megatron config updates to avoid OOM by @ashors1 in #687
- fix: upgrade datasets to fix squad download by @ashors1 in #692
- fix: fix lr scheduler for config that was missed in #681 by @ashors1 in #693
- fix: Fix gemma models broken by HF update by @yfw in #676
- chore: add CP+SP (sequence parallel) assertion in DTensor worker by @yuki-666 in #689
- feat: MLFlow Integration for experiment tracking by @terrykong in #697
- fix: Fix activation checkpointing for mcore path by @yfw in #703
- feat: Enable Context Parallelism and Sequence Packing for MCore and Dtensor by @SahilJain314 in #704
- fix: SyntaxWarning: invalid escape sequence '\s' by @RayenTian in #705
- chore: Bump 0.2.1 -> 0.3.0 by @terrykong in #710
- cp:
fix: unset TP and PP in sft 1 GPU config (717)
intor0.3.0
by @chtruong814 in #726 - cp:
docs: remove doc duplicated (721)
intor0.3.0
by @chtruong814 in #733 - cp:
fix: guard DPO against dynamic batching and sequence packing (730)
intor0.3.0
by @chtruong814 in #735 - cp:
fix: remove dynamic batching from 8B llama dtensor config (#728)
by @terrykong in #736 - cp:
docs: Update docs to include submodule instructions (725)
intor0.3.0
by @chtruong814 in #737 - cp:
docs: Added docs for sequence packing and dynamic batching (729)
intor0.3.0
by @chtruong814 in #753 - cp:
fix: Use the conditional temperature scaling in get_logprobs as well (714)
intor0.3.0
by @chtruong814 in #752 - cp:
fix: Disable sequence packing in qwen moe config to prevent hang (750)
intor0.3.0
by @chtruong814 in #754 - cp:
docs: fix frontpage outdated eval docs (738)
intor0.3.0
by @chtruong814 in #756
New Contributors
- @jhinpan made their first contribution in #409
- @weifengpy made their first contribution in #464
- @killershrimp made their first contribution in #470
- @pcmoritz made their first contribution in #537
- @vegaluisjose made their first contribution in #516
- @peri044 made their first contribution in #536
- @butsugiri made their first contribution in #502
- @wedu-nvidia made their first contribution in #583
- @AtsunoriFujita made their first contribution in #485
- @xxman-google made their first contribution in #559
- @guyueh1 made their first contribution in #589
- @ffrujeri made their first contribution in #628
- @Dazz993 made their first contribution in #636
- @jubick1337 made their first contribution in #660
- @RayenTian made their first contribution in #705
Full Changelog: v0.2.1...v0.3.0