JetEngine

JetEngine, a lightweight inference engine for the SDAR series (and other diffusion block decoding models) support both dense and MoE models and Tensor Parallel distributed inference, delivers tons of acceleration compared to the naive implementation.

In our benchmark, we tested the 4B SDAR model with block size 4 (basic acceleration setting) and batch size 128:

On NVIDIA A800, JetEngine reached 3000+ tokens/second.
On NVIDIA H200, JetEngine achieved 5000+ tokens/second using FlashAttention-2 + Triton kernels.

This demonstrates that JetEngine can unlock production-level throughput for SDAR models, making it ideal for both research-scale batch inference and real-world deployment scenarios.

Notice

For inference on llada, the large logits tensor will easily explode your GPU memory.

For example, with batch size 32 and context length 1024. The total logits will occupy roughly 38GB GPU memory.

So for pure diffusion model consider decreasing the max_num_seqs when initializing LLM and max_active when generating.

🚀 New Features

[11/04/2025] Support dllm-Var

path = os.path.expanduser("dLLM-Var")
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# Llada use 126336 as mask token
llm = LLM(path, enforce_eager=False, tensor_parallel_size=1, mask_token_id=126336, block_length=64, gpu_memory_utilization=0.9) # Must set mask_token_id & block_length
sampling_params = SamplingParams(temperature=1.0, topk=0, topp=1.0, max_tokens=2048,
                                 remasking_strategy="low_confidence_dynamic", block_length=64, denoising_steps=64, dynamic_threshold=0.90)

[11/03/2025] Support LLaDa

path = os.path.expanduser("LLaDA-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# Llada use 126336 as mask token
llm = LLM(path, enforce_eager=False, tensor_parallel_size=1, mask_token_id=126336, block_length=1024, gpu_memory_utilization=0.9) # Must set mask_token_id & block_length
sampling_params = SamplingParams(temperature=1.0, topk=0, topp=1.0, max_tokens=2048,
                                 remasking_strategy="low_confidence_dynamic", block_length=1024, denoising_steps=1024, dynamic_threshold=0.90)

As long as the block_length > prompt length this is a pure Llada. But you can try with block_length < prompt length, you will see some interesting behavior.

[10/23/2025] Further speedup

[09/15/2025] Support completely offload the model and kv cache to free memory for RL training

[09/14/2025] Support Hybrid Data Parallel and Tensor Parallel Inference

[09/07/2025] Support Entropy Bounded sampler

SamplingParams(temperature=1.0, topk=0, topp=1.0, max_tokens=4096, remasking_strategy="entropy_bounded", block_length=4, denoising_steps=4, eb_threshold=0.6)

eb_threshold is the $\gamma$ value from the above paper

Installation

Environment Setup

transformers>=4.52.4
flash-attn

For Local Inference (support DP and TP, managed by accelerate from huggingface):

pip install flash-attn --no-build-isolation
git clone https://github.com/Labman42/JetEngine.git
cd JetEngine
pip install .

Model Download

hf download --resume-download JetLM/SDAR-1.7B-Chat \
  --repo-type model \
  --local-dir ~/huggingface/SDAR-1.7B-Chat/ \

Quick Start

On default, use all gpus for Data Parallel

accelerate launch --multi_gpu example.py

Use single GPU

CUDA_VISIBLE_DEVICES='0' accelerate launch --multi_gpu example.py

or you can add your own config file for accelerate.

See example.py for usage. Please check the config.py and sampling_params.py carefully.

📬 Contact

For issues or inquiries:

Yihan Bian, University of Maryland, College Park (ybian@umd.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
jetengine		jetengine
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JetEngine

Notice

🚀 New Features

Installation

Environment Setup

Model Download

Quick Start

📬 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

Labman42/JetEngine

Folders and files

Latest commit

History

Repository files navigation

JetEngine

Notice

🚀 New Features

Installation

Environment Setup

Model Download

Quick Start

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages