HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 date: Tue, 29 Jul 2025 21:59:26 GMT content-type: text/html; charset=utf-8 vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With etag: W/"36c7d3d6606e053a1d289caf91c75b0a" cache-control: max-age=0, private, must-revalidate strict-transport-security: max-age=31536000; includeSubdomains; preload x-frame-options: deny x-content-type-options: nosniff x-xss-protection: 0 referrer-policy: no-referrer-when-downgrade content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/ server: github.com content-encoding: gzip accept-ranges: bytes set-cookie: _gh_sess=uNfVfSm75hE6lwYOpsXz8EYkj%2BX%2BekiYFj0IRI1mXQeU07nYzJSYFdSMLbSTGXNZq068og4x8gDxiLfJcV5myfaRF2lEHGJvLJ9njtZLZRTI6QJZ8ftBQ88MBpSh2rdglvA9FOJ3rlxIKzgbb7G6AvO4432RY1tI%2BfocmiKjin%2BMevJgGRvE9TMc8YVX1dhszQYxyblls8WrPeq4KssYfmeJFwiDOfK8PUofxpvUYl2Lfz48CWucDt4DYoH2Ktq4m%2FGgc56aeFEUDjAIH2qKXQ%3D%3D--cOjNuJt5UdBAhjoi--t%2FcksNwOeX0oKjqBOd42mQ%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax set-cookie: _octo=GH1.1.1786069670.1753826365; Path=/; Domain=github.com; Expires=Wed, 29 Jul 2026 21:59:25 GMT; Secure; SameSite=Lax set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Wed, 29 Jul 2026 21:59:25 GMT; HttpOnly; Secure; SameSite=Lax x-github-request-id: 8AEA:2A06CB:104189C:136E3C8:6889443D Tags · deepspeedai/DeepSpeed · GitHub

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

deepspeedai / DeepSpeed Public

Notifications You must be signed in to change notification settings
Fork 4.5k
Star 39.5k

Code
Issues 1.1k
Pull requests 110
Discussions
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Tags: deepspeedai/DeepSpeed

Releases Tags

v0.17.3

Toggle v0.17.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

Fix: Adapt Llama injection policy for newer transformers versions (#7443
)
This PR fixes an `AttributeError` that occurs during
`deepspeed.init_inference` when using kernel injection
(`replace_with_kernel_inject=True`) with Llama models from recent
versions of `transformers`.
**The Bug:**
In newer `transformers` versions (e.g., `4.53.3`), configurations like
`num_heads` and `rope_theta` were moved from direct attributes of the
`LlamaAttention` module into a nested `config` object.
The current DeepSpeed injection policy tries to access these attributes
from their old, direct location, causing the initialization to fail with
an `AttributeError: 'LlamaAttention' object has no attribute
'num_heads'`.
**The Solution:**
This change updates the Llama injection logic to be more robust:
1. It first tries to read attributes like `num_heads` from the new
`config` object location.
2. If that fails, it falls back to the legacy direct attribute path.
---------
Signed-off-by: huanyuqu <yc37960@um.edu.mo>

Jul 26, 2025
092625c
zip
tar.gz
Notes

v0.17.2

Toggle v0.17.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

fix: engine initializes optimizer attributes at the beginning (#7410)
As in `destroy`, `self.optimizer` is called, but the error out calling
to `destroy` can happen in `__init__`, even before optimizer and
scheduler is configured. So we need to move `self.optimizer` to the top
to avoid triggering another exception.
e.g.:
```logs
  File "deepspeed/runtime/engine.py", line 453, in _configure_tensor_parallel_states
    assert self.zero_optimization_stage(
AssertionError: Currently, the compatibility between 'autotp' and 'zero_stage = 3' has not been validated
Exception ignored in: <function DeepSpeedEngine.__del__ at 0x1516c0610820>
Traceback (most recent call last):
  File "deepspeed/runtime/engine.py", line 509, in __del__
    self.destroy()
  File "deepspeed/runtime/engine.py", line 512, in destroy
    if self.optimizer is not None and hasattr(self.optimizer, 'destroy'):
  File "deepspeed/runtime/engine.py", line 621, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DeepSpeedEngine' object has no attribute 'optimizer'
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>

Jul 7, 2025
15f054d
zip
tar.gz
Notes

v0.17.1

Toggle v0.17.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

Move pytest pinning from individual tests to requirements-dev.txt unt…
…il fixed. (#7327)
pytest 8.4.0 seems to break a number of our tests, rather than pinning
in each individually, we should just put this in the requirements file
until we resolve the issue.
---------
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>

Jun 9, 2025
2ce5505
zip
tar.gz
Notes

v0.17.0

Toggle v0.17.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

Bump to v0.17.0 (#7324)
Co-authored-by: Logan Adams <loadams@microsoft.com>

Jun 2, 2025
720787e
zip
tar.gz
Notes

v0.16.9

Toggle v0.16.9's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

[XPU] Support XCCL on deepspeed side (#7299)
XCCL will be used for XPU device on Pytorch-2.8, with this support will
remove torch-ccl on XPU device, and we will also reserve the old path
for torch-CCL enable.
---------
Signed-off-by: yisheng <yi.sheng@intel.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>

May 22, 2025
bdba823
zip
tar.gz
Notes

v0.16.8

Toggle v0.16.8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

rollback #6726 (#7258)
This PR rollback #6726 which caused
#7116 .
---------
Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

May 19, 2025
f459502
zip
tar.gz
Notes

v0.16.7

Toggle v0.16.7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

Make sure it's not None before offloading contiguous_grad_buffer (#7227)
Resolves #7223
When DeepCompile is enabled in ZeRO-3, contiguous_grad_buffer is
released, so we should check and make sure it's not None before we
continue.
https://github.com/deepspeedai/DeepSpeed/blob/227a60c0c412ddf4619401b5d8d9d1674aee17b5/deepspeed/compile/init_z3.py#L22-L25
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

Apr 18, 2025
c66fdaf
zip
tar.gz
Notes

v0.16.6

Toggle v0.16.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

DeepCompile for enhanced compiler integration (#7154)
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.
Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Apr 16, 2025
227a60c
zip
tar.gz
Notes

v0.16.5

Toggle v0.16.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

Variable batch size and LR scheduler (#7104)
# Background and rationale
In many use cases, particularly LLMs, one is faced with inputs
(sentences) of variable lengths. A common practice is to pack batches by
token count (not a fixed batch size), ie by putting together sentences
whose given metric (eg sequence lengths) will add up to an user-provided
value. As an example, in [Attention is all you
need](https://arxiv.org/abs/1706.03762), section 5.1:
> Sentence pairs were batched together by approximate sequence length.
Each training
batch contained a set of sentence pairs containing approximately 25000
source tokens and 25000
target tokens.
Dynamic batch sizes has been requested in [DeepSpeed issue
1051](#1051), [DeepSpeed
issue 3455 ](#3455),
[Pytorch Lightning issue
16914](Lightning-AI/pytorch-lightning#16914),
[huggingface issue
2647](huggingface/accelerate#2647) and is
available already in many libraries e.g. [NVIDIA
Triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
and [Meta FairSeq](https://github.com/facebookresearch/fairseq)
(implementation
[here](https://github.com/facebookresearch/fairseq/blob/34973a94d09ecc12092a5ecc8afece5e536b7692/fairseq/data/fairseq_dataset.py#L104)
).
The immediate use case for this is when one needs to maximize GPU
utilization. Moreover, this is particularly relevant for curriculum
learning where a `BxTxE` (Batch x Time x Embedding) -shaped input should
ideally have high `B` and low `T` at the early curriculum steps (many
short sentences packed together as a batch), and low `B` and high `T` at
the late steps (few long sentences in the batch). A dynamic size `T` is
already supported by Deepspeed, e.g. in the documentation for pipeline
parallelism's
[reset_activation_shape()](https://deepspeed.readthedocs.io/en/stable/pipeline.html#deepspeed.runtime.pipe.engine.PipelineEngine.reset_activation_shape):
> For curriculum learning that changes the seqlen of each sample, we
need to call this whenever the seqlen is going to change.
However, dynamic `B` is not supported. A dynamic `B` would require an
adequate increase/decrease of learning rate. This technique has been
applied previously, and the two most common LR scaling algorithms have
been described as:
1. Linear Scaling Rule: "When the minibatch size is multiplied by k,
multiply the learning rate by k", as in [Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour, Goyal et
al.](https://arxiv.org/abs/1706.02677)
2. Square Root scaling: "when multiplying the batch size by k, multiply
the learning rate by √k, to keep the variance in the gradient
expectation constant" by [One weird trick for parallelizing
convolutional neural networks, A. Krizhevsky et
al.](https://arxiv.org/abs/1404.5997)
In practice, the user picks the total token count per batch as the
metric that drives batching, instead of batching by sentence count.
During runtime, the variable batch size is computed and the LR is
adjusted respectively, based on the LR and batch size provided by the
config.
# Illustration of dynamic batch size, sequence length and LR
Imagine we picked a limit of `30` tokens per batch, and have set a
reference `lr=1e-3` for a `train_batch_size=2` (in the deepspeed
config). The batching algorithm for curriculum may pack the data into
batches of short sentences (left) at the early stages, and batches of
long sentences (right) as later stages, e.g.:
![dynamic_batch_size_and_lr](https://github.com/microsoft/DeepSpeed/assets/150697676/324bda09-8f0b-430c-bb33-cc1bd01c3fe7)
Above, we collected samples until we filled up the batch with at most 30
tokens. The batch sizes (number of samples) became then `10` and `4` on
the left and right examples, respectively. Using the linear scaling
rule, the LR for those batches become `5e-3` and `2e-3`.
# Pipeline parallelism
Pipeline parallelism requires the same batch size and same sequence
length across all micro-batches in a batch, as the activation sizes must
be fixed between gradient accumulation steps. Between batches, these may
change, and long as `engine.reset_activation_shape()` is called so that
the new shapes are communicated on the first gradient accumulation step
in the batch. Enforcing similar `BxTxE` between batches may lead to
smaller micro-batches. As an example, below we can see an illustration
of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching
for the same dataset, when preparing data for the regular DDP (left) and
for the pipeline parallelism use cases (right):
![dynamic_batch_size_and_lr_microbatching](https://github.com/microsoft/DeepSpeed/assets/150697676/3fed5e1c-f2f5-4efe-a9c5-5b5e20719d45)
We can see that the pipeline use case (right) has the same `BxTxE` shape
across all the 4 micro-batches in the same batch, and in order to
respect that, it packs less samples in the batch, when compared to the
standard use case (left hand size)
# Attention Head
For an input of size `BxTxE` the attention has a shape of `TxT` for a
mask of fixed size across samples of same size, or `BxTxT` for a
different mask per sample (when samples have different sizes, as in the
dataset above). This 3D attention matrix can be illustrated for the DDP
microbatch 1 (picture above top-left, 4 sentences) as:
 
![dynamic_batch_size_and_lr_attn_matrix](https://github.com/microsoft/DeepSpeed/assets/150697676/707d2f17-66da-4034-8a12-a87df2044bfb)
Note the memory savings: the attention head has a size of `BxTxT`, i.e.
a linear memory dependency on the batch size `B` and quadratic memory
dependency on the largest sequence length `T` in the (micro-) batch.
Thus, supporting a dynamic size `T` allows for an increase of `B`.
# PR overview
This PRs implements dynamic batching and LR scaling. The dataloader and
LR scheduler necessary can be retrieved by calling
`get_dataloader_and_lr_scheduler_for_variable_batch_size`. A small
explanation of that function follows:
- The logic behind the algorithms for LR scaling is in `scale_lr`;
- The partitioning of samples into batches is done by `batch_by_seqlen`.
- For pipeline parallelism, it is required that all micro-batches in a
pipeline pass to have the same activation shapes. This is enabled by
setting to `True` the following parameters:
- `required_microbatches_of_same_sizes` that will force the `B`
dimension to be the same across all gradient accumulation steps of all
dataloaders on a batch;
- `required_microbatches_of_same_lengths` that will force the `T`
dimension to be the same across all gradient accumulation steps. Works
by calling the user-provided `sample_padding_fn(sentence, len)` that
pads a given sentence to the argument length;
- `batch_by_seqlen` returns `microbatch_sample_ids` (the list of sample
ids per micro-batch), `batch_sizes` (the size of effective batch sizes,
and `batch_max_seqlens` (longest sequence across all microbatches in a
batch)
- `dataloader_for_variable_batch_size` relies on `microbatch_sample_ids`
and will iterate/collate/pad samples for every batch and return a
dataloader that iterates the final (variable-size) batches;
- `lr_scheduler_for_variable_batch_size` relies on `batch_sizes` to
compute the learning rate for each effective batch, taking into account
the batch size and LR in the config file, and scaling the LR based on
the size of each effective batch, and the scaling rule mentioned above
(Linear, Square root, etc).
- Special note to the `lr_scheduler` returned that will either accept
either:
1. an user-provided `Optimizer` that will scale the learning rates (in
param groups) at every batch, or
2. an user-defined `LRScheduler`, that in this case will first get the
learning rate from the scheduler and then scale it accordingly.
# Example
An example for the use case with and without pipelining is provided in
file
[`DeepSpeedExamples/training/data_efficiency/variable_batch_size_and_lr/variable_batch_size_and_lr_example.py`](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/data_efficiency/variable_batch_size_and_lr).
The example shows an attention head with attention of variable-sized
`BxTxT` per batch, followed by a fixed size feed forward network. These
are the main blocks on a Large Language Model. The feed-forward (or
linear layer) that follows the attention head requires a constant input
size, equivalent to the largest sentence in the whole dataset, so the
output of the attention must be padded (see `feedforward: needs to
convert BxTxE to BxMxE by padding extra tokens` in the code).
# Config
The example file also comments the relevant deepspeed config with
comments:
```python
config = {
  "train_batch_size": 16,
  # `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together.
  #  I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu`*`max_tokens`.
  "train_micro_batch_size_per_gpu": 2,
  "data_efficiency": {
    "enabled": True,
    # seed to be applied to all data efficiency modules, including dynamic batching
    "seed": 42,
    "data_sampling": {
      "num_workers": 0, # dataloader num_workers argument
      "pin_memory": False,  # dataloader pin_memory argument
      "dynamic_batching": {
        # enables or disables dynamic batching
        "enabled": True,
        # how many tokens we need to fill a pack of sequences (that will be collated together as a sample)
        "max_tokens": 100,
        # Input and output write to read from or write the length of every sequence.
        # Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and *.idx
        # If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs.
        "metrics_path": "./curriculum_output/",
        # As batch size increases/decreses, which method to use to scale LR accordingly?
        # Options: linear, sqrt (square root), or None to disable
        "lr_scaling_method": "linear",
        # how to pick sentences to be packed into samples:
        # - dataloader: by same order as they come in with the dataloader
        # - seqlen: by sequence length (shortest to longest)
        # - random: random order using the seed in config['data_efficiency']['seed'
        "sentence_picking_order": "dataloader",  # "random" / "seqlen" / "dataloader"
        # minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded.
        "min_batch_size": 1,
        # maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded.
        "max_batch_size": 10,
        # enable the output of microbatching information about sentence packing
        "verbose": True,
      },
    },
  },
}
```
# Future work
A follow-up PR will enable dynamic batching when calling
`deepspeed.initialize`. I.e. instead of this:
```python
engine, _, _, _ = deepspeed.initialize(config=config, model=model)
dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...)
engine.lr_scheduler = lr_scheduler
```
we'd ideally have this:
```python
engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model)
```
where `initialize` will call internally
`get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed`.
---------
Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Mar 27, 2025
20f988e
zip
tar.gz
Notes

v0.16.4

Toggle v0.16.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified

Learn about vigilant mode

Fix, bf16 optimizer remove dup loop (#7054)
bf16 with moe refresh optimizer state from bf16 ckpt will raise
IndexError: list index out of range
Signed-off-by: shaomin <wukon1992@gmail.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>

Feb 20, 2025
e2dc3ee
zip
tar.gz
Notes

PreviousNext

Footer

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.

HOME
ABOUT
AUCTIONS
SHIPPING
FEES
TOOLS
HOW
FAQ
CONTACT

Original Source | Taken Source