Release Notes

@anujj

What's Changed

Enable continuous decoding for NvTensorRtRtx EP by @anujj in #1697
Use updated Decoder API with skip_special_tokens by @sayanshaw24 in #1722
Update extensions to include memleak fix by @baijumeswani in #1724
Support batch processing for whisper example by @jiafatom in #1723
Update onnxruntime_extensions dependency version by @baijumeswani in #1725
Include C++ header in native nuget and fix compiler warnings by @baijumeswani in #1727
Update Microsoft.Extensions.AI to 9.8.0 by @rogerbarreto in #1689
Update Extensions commit for Qwen 2.5 Chat Template Tools Fix by @sayanshaw24 in #1730
Whisper Truncation Extensions Commit Update by @sayanshaw24 in #1735
Enable Cuda Graph for TensorRtRtx by default by @anujj in #1734
Update sampling benchmark by @tianleiwu in #1729
Add Windows WinML x64 build workflow by @chrisdMSFT in #1740
Fix CUDA synchronization issue between ORT-GenAI and TRT-RTX inference by @anujj in #1733
Hello WindowsML by @chrisdMSFT in #1711
[CUDA] sampling kernel improvements by @tianleiwu in #1732
Update GitHub Actions to latest versions by @snnn in #1749
Update WinML version to 1.8.2091 by @nieubank in #1750
Address macos packaging pipeline issues by @baijumeswani in #1747
ProviderOptions level device filtering and APIs to configure model level device filtering by @vortex-captain in #1744
Fix string indexing bug with Phi-4 mm tokenization by @kunal-vaishnavi in #1751
Fix TRT-RTX EP regression by @gaugarg-nv in #1754
Fix typo in C API header by @kunal-vaishnavi in #1753
Enable WinML by default in ADO pipelines by @chrisdMSFT in #1755
Change default build configuration to 'relwithdebinfo' by @baijumeswani in #1757
Pin cmake and vcpkg versions in macOS workflows by @snnn in #1760
Add TRT_RTX support for onnxruntime-genai-trt-rtx wheel by @anujj in #1736
rel-0.10.0 by @chrisdMSFT in #1767
Microsoft.ML.OnnxRuntimeGenAI.WinML.props by @chrisdMSFT in #1776
Warning fix - ort_genai.h by @chrisdMSFT in #1778
Microsoft.ML.OnnxRuntimeGenAI.targets by @chrisdMSFT in #1781

New Contributors

@nieubank made their first contribution in #1750

Full Changelog: v0.9.2...v0.10.0

This release fixes a pre-processing bug with Phi-4 multimodal.

Full Changelog: v0.9.1...v0.9.2

@baijumeswani

🚀 Features

Support for Continuous Batching (#1580) by @baijumeswani
RegisterExecutionProviderLibrary (#1628) by @vortex-captain
Enable CUDA graph for LLMs for NvTensorRtRtx EP (#1645) by @anujj
Add support for smollm3 (#1666) by @xenova
Add OpenAI's gpt-oss to ONNX Runtime GenAI (#1678) by @kunal-vaishnavi
Add custom ops library path resolution using EP metadata (#1707) by @psakhamoori
Use OnnxRuntime API wrapper for EP device operations (#1719) by @psakhamoori

🛠 Improvements

Update Extensions Commit to Support Strft Custom Function for Chat Template (#1670) by @sayanshaw24
Add parameters to chat template in chat example (#1673) by @kunal-vaishnavi
Update how Hugging Face's config files are processed (#1693) by @kunal-vaishnavi
Tie embedding weight sharing (#1690) by @jiafatom
Improve top-k sampling CUDA kernel (#1708) by @gaugarg-nv

🐛 Bug Fixes

Fix accessing final norm for Gemma-3 models (#1687) by @kunal-vaishnavi
Fix runtime bugs with multi-modal models (#1701) by @kunal-vaishnavi
Fix BF16 CUDA version of OpenAI's gpt-oss (#1706) by @kunal-vaishnavi
Fix benchmark_e2e (#1702) by @jiafatom
Fix benchmark_multimodal (#1714) by @jiafatom
Fix pad vs. eos token misidentification (#1694) by @aciddelgado

⚡ Performance & EP Enhancements

NvTensorRtRtx: Support num_beam > 1 (#1688) by @anujj
NvTensorRtRtx: Skip if node of Phi4 models (#1696) by @anujj
Remove QDQ and Opset Coupling for TRT RTX EP (#1692) by @xiaoyu-work

🔒 Build & CI

Enable Security Protocols in MSVC for BinSkim (#1672) by @sayanshaw24
Explicitly specify setup-java architecture in win-cpu-arm64-build.yml (#1685) by @edgchen1
Use dotnet instead of nuget in mac build (#1717) by @natke

📦 Versioning & Release

Update version to 0.10.0 (#1676) by @ajindal1
Cherrypick 0: Forgot to change versions (#1721) by @aciddelgado
Cherrypick 1... Becomes RC1 (#1726) by @aciddelgado
Cherrypick 2 (#1743) by @aciddelgado

🙌 New Contributors

@xiaoyu-work (#1692)
@psakhamoori (#1707)

✅ Full Changelog: v0.9.0...v0.9.1

@ajindal1

What's Changed

New Features

Constrained decoding integration by @ajindal1 in #1381
Update constrained decoding by @ajindal1 in #1477
Enable TRT multi profile option though provider option by @anujj in #1493
Add support for Machine Translation model by @apsonawane in #1482
Overlap prompt processing KV cache update for WindowedKeyValueCache in DecoderOnlyPipelineState by @edgchen1 in #1526
Add basic support for tracing by @edgchen1 in #1524
Logging SetLogCallback + Debugging cleanup by @RyanUnderhill in #1471
Support loading models from memory by @baijumeswani in #1571
Add SLM Engine support function calling by @kinfey in #1582
Pass the batch_size thought the Overlay by @anujj in #1627
Enable GPU based sampling for TRT-RTX by @gaugarg-nv in #1650

Model Builder Changes

Whisper Redesigned Solution by @kunal-vaishnavi in #1229
[Builder] Add support for Olive quantized models by @jambayk in #1647
Add Qwen3 to model builder by @xenova in #1428
Model builder: Add ability to exclude a node from quantization by @sushraja-msft in #1436
Support k_quant in model builder by @jiafatom in #1444
Add final norm for LoRA models by @kunal-vaishnavi in #1446
Add bfloat16 support in model builder by @kunal-vaishnavi in #1447
Fix accuracy issues with Gemma models by @kunal-vaishnavi in #1448
Always cast bf16 logits to fp32 by @nenad1002 in #1479
NvTensorRtRtx EP option in GenAI - model builder by @BLSharda in #1453
Add Gemma3 Model support for NvTensorRtRtx execution provider by @anujj in #1520
Use IRv10 in the model builder by @justinchuby in #1547
[Builder] Rename methods make_value and make_initializer by @justinchuby in #1554
Always use opset21 in builder by @justinchuby in #1548
Clamp KV Cache Size to Sliding Window for NvTensorRtRtx EP by @BLSharda in #1523
[Builder] Fix output name in make_rotary_embedding_multi_cache by @justinchuby in #1562
[Builder] Use lazy tensor by @justinchuby in #1556
[Builder] Fix KeyError for torch.uint8 in dtype mapping for MoE quantization by @Copilot in #1561
[Builder] Fix 1d constant creation by @justinchuby in #1568
[Builder] Create progress bar by @justinchuby in #1559
[Builder] Use packed 4bit tensors directly by @justinchuby in #1566
[Builder] Simplify constant creation by @justinchuby in #1569
[Builder] Add cuda-bfloat16 entry to valid_gqa_configurations by @justinchuby in #1585
[Builder] use dtype conversion helpers from onnx_ir by @justinchuby in #1587
[Model builder] Add support for Ernie 4.5 models by @xenova in #1608
whisper: Allow session options to be used for encoder by @RyanMetcalfeInt8 in #1622
Make default top_k=50 in model builder by @jiafatom in #1642
Update builder.py by @lnigam in #1665
Change IO dtype for INT4 CUDA models by @kunal-vaishnavi in #1629

Bug fixes

CUDA Top K / Top P Fixes by @aciddelgado in #1371
Persist provider options across ClearProviders, AppendProvider where possible by @baijumeswani in #1454
Add enable_skip_layer_norm_strict_mode flag by @nenad1002 in #1462
Avoid adding providers if not requested by @baijumeswani in #1464
Fix array eos_token_id handling by @RyanUnderhill in #1463
Remove BF16 CPU from valid GQA configuration by @nenad1002 in #1469
Address QNN specific regressions by @baijumeswani in #1470
Fix how torch tensors are saved by @kunal-vaishnavi in #1476
Fix model chat example for rewind by @ajindal1 in #1480
Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1497
Fix missing parameter name by @xadupre in #1502
Fix from pretrained method for quantized models by @kunal-vaishnavi in #1503
Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj in #1505
Fix last layer generation for text-only models by @nenad1002 in #1513
[Fix] Remove references to TensorProto by @justinchuby in #1549
Fix make_layernorm_casts usage of value infos by @justinchuby in #1551
Fix DML Memory Leak by @aciddelgado in #1578
[DML] Bind the dml global objects to the Model by @baijumeswani in #1590
NvTensorRTRTx: Enable CUDA graph via config and fix attention_mask shape handling by @anujj in #1594
Append eos token to the end of input sequence for marian models by @apsonawane in #1630
Use two-step Softmax to do cuda sampling by @jiafatom in #1617
Use two-step softmax for CPU sampling by @jiafatom in #1631
Use last windowed input ids to update logits by @baijumeswani in #1636
Fix attention‑mask stride bug for static masking (batch > 1) by @anujj in #1639
Add open bytes functionality for C# by @ajindal1 in #1634

Packaging/Testing/Pipelines

Sign macos binaries by @baijumeswani in #1439
Add chat template tests by @sayanshaw24 in #1457
Update triggers by @snnn in #1490
Add support for building a cuda + dml package by @baijumeswani in #1600
NvTensorRtRtx: Pass the dynamic shapes (ISL and batch_size) to the ep at runtime as nv profile. by @anujj in #1614
Update docker image by @snnn in #1633
sign all genai dlls, in both onnxunrime-genai and python targets by @vortex-captain in #1635
Fixes all packaging pipelines by @baijumeswani in #1641
Update the benchmark scripts to account for the time spent in sampling by @gaugarg-nv in #1646
Add date for nightly packages by @ajindal1 in #1668

Compliance

Enable policheck in packaging pipeline by @apsonawane in #1449
Add third party notices in file exclusion by @apsonawane in #1459
Enable tsa options in packaging pipelines by @apsonawane in #1460
Update windows packaging pipelines to use build.py by @aciddelgado in #1468

Documentation and Examples

Update OnnxRuntimeGenAIChatClient with chat template and guidance by @stephentoub in #1533
Update SimpleGenAI.java docs by @edgc...

@aciddelgado

This release addresses regressions with DML.

Fixes include:

What's changed

New features

Use Accuracy level 4 for webgpu by default by @guschmue (#1474)
Enable guidance by default on macos by @ajindal1 (#1514)

Bug fixes

Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj (#1505)
Update Extensions Commit for 0.8.2 by @sayanshaw24 (#1519)
Update Extensions Commit for another DeepSeek Fix by @sayanshaw24 (#1521)

Packaging and testing

Update triggers by @snnn (#1490)

Full Changelog: v0.8.1...v0.8.2

What's changed

New features

Integrate tools input into Chat Template API by @sayanshaw24 (#1472)

NvTensorRtRtx EP option in GenAI - model builder by @BLSharda (#1453)
Enable TRT multi profile option though provider option by @anujj (#1493)

Bug fixes

Always cast bf16 logits to fp32 by @nenad1002 (#1479)

Examples and documentation

Update Chat Template Examples for Tools API change by @sayanshaw24 (#1506)
Fix model chat example for rewind by @ajindal1 (#1480)

Model builder changes

Fix from pretrained method for quantized models by @kunal-vaishnavi (#1503)
Fix missing parameter name by @xadupre (#1502)
minor change to support qwen3 by @guschmue (#1499)
Fix how torch tensors are saved by @kunal-vaishnavi (#1476)
Support k_quant in model builder by @jiafatom (#1444)

Dependency updates

Update to stable release of Microsoft.Extensions.AI.Abstractions by @stephentoub (#1489)
Update to M.E.AI 9.4.3-preview.1.25230.7 by @stephentoub (#1443)

Full Changelog: v0.8.0...v0.8.1

@sayanshaw24

What's Changed

New Features

Add Chat Template API Changes by @sayanshaw24 in #1398
Add Python and C# bindings for Chat Template API by @sayanshaw24 in #1411
Support for gemma3 model by @baijumeswani in #1374
Support more QNN models with different model structures by @baijumeswani in #1322
Add ability to load audio from bytes, to match images API by @RyanUnderhill in #1304
Add support for DML Graph Capture to improve speed by @aciddelgado in #1305
Added OnnxRuntimeGenAIChatClient ctor with Config. by @azchohfi in #1364
Extensible AppendExecutionProvider and expose OrtSessionOptions::AddConfigEntry directly by @RyanUnderhill in #1384
OpenVINO: Model Managed KVCache by @RyanMetcalfeInt8 in #1399
Changes how the device OrtAllocators work, use a global OrtSession instead by @RyanUnderhill in #1378
Remove audio attention mask processing and update ort-extensions by @baijumeswani in #1319
Simplify the C API definitions and prevent any type mismatches going forward by @RyanUnderhill in #1365

Model builder updates

Quark Quantizer Support by @shobrienDMA in #1207
Add Gemma 3 to model builder by @kunal-vaishnavi in #1359
Initial support for VitisAI EP by @AnanyaA-9 in #1370
[OVEP] feat: Adding OpenVINO EP in ORT-GenAI by @ankitm3k in #1389
Initial support for NV EP by @BLSharda in #1404
Adapt to MatMulNBitsQuantizer in ort by @jiafatom in #1426
Fix LM head for Gemma-2 by @kunal-vaishnavi in #1420

Bug Fixes

Fix mismatch in Java bindings by @CaptainIRS in #1307
Fix type mismatch in Java bindings by @CaptainIRS in #1313
Update ort-extensions to fix tokenizer bug for phi4 by @baijumeswani in #1331
Windows: Show more useful DLL load errors to say exactly what DLL is missing by @RyanUnderhill in #1345
deprecate graph cap by @aciddelgado in #1338
Support load/unload of models to avoid QNN errors on deepseek r1 1.5B by @baijumeswani in #1346
Add missing 'value_stats' to logging API, and fix wrong default by @RyanUnderhill in #1353
Convert tokens to list for concat by @ajindal1 in #1358
Improve and Fix TopKTopP by @jiafatom in #1363
Switch the order of softmax on CPU Top K by @aciddelgado in #1354
Update pybind and fix rpath for macos and check for nullptr by @baijumeswani in #1367
iterate over the providers by @baijumeswani in #1486
Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1487

Examples and Documentation

Update README.md by @RyanUnderhill in #1372
Add slm engine example by @avijit-chakroborty in #1242
Added cancellation to the streaming method of OnnxRuntimeGenAIChatClient. by @azchohfi in #1289
Update nuget README with latest API by @natke in #1326
Update C examples downloads by @ajindal1 in #1332
Add Q&A Test Example in Nightly by @ajindal1 in #1277
docs: update the doc of slm_engine to ensure consistency with the code by @dennis2030 in #1386
C++ and python samples: follow_config support by @RyanMetcalfeInt8 in #1413
Fix Do Sample example by @ajindal1 in #1337
Make phi3 example Q&A rather than chat by @ajindal1 in #1392
Fix broken link in package description by @rogerbarreto in #1360

Packaging and Testing

Remove DirectML.dll dependency by @baijumeswani in #1342
Add support to creating a custom nuget in the packaging pipeline by @baijumeswani in #1315
Remove onnxruntime-genai-static library (non trivial change) by @RyanUnderhill in #1264
Add macosx to custom nuget package by @baijumeswani in #1419
Update the C++ clang-format lint workflow to use clang 20 by @snnn in #1418
Add model_benchmark options to specify prompt to use. by @edgchen1 in #1328
Add value_stats logging option to show statistical information about … by @RyanUnderhill in #1352
Fixed the MacOS build and updated the test script. by @avijit-chakroborty in #1310
Fix iOS packaging pipeline after static library removal by @RyanUnderhill in #1316
fix bug in python benchmark script by @thevishalagarwal in #1206
Fix macos package by @baijumeswani in #1347
Missing *.dylib in package_data, so Mac would not package our shared libraries by @RyanUnderhill in #1341

Dependency Updates

Update upload Artifact version by @ajindal1 in #1274
Update to M.E.AI 9.3.0-preview.1.25161.3 by @stephentoub in #1317
Update android min sdk version to 24 by @baijumeswani in #1324
Update torch to 2.5.1 by @baijumeswani in #1343
Update Pipelines for S360 by @ajindal1 in #1323
Update Nuget pkg name by @ajindal1 in #1351
update version to 0.8.0 by @baijumeswani in #1376
Update custom nuget packaging logic by @baijumeswani in #1377
Update Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 by @stephentoub in #1388
Bump torch from 2.5.1 to 2.6.0 in /test/python/macos/torch by @dependabot in #1408
Bump torch from 2.5.1+cu124 to 2.6.0+cu124 in /test/python/cuda/torch by @dependabot in #1409
Bump torch from 2.5.1+cpu to 2.7.0 in /test/python/cpu/torch by @dependabot in #1422
pin cmake version by @snnn in #1424

New Contributors

@avijit-chakroborty made their first contribution in #1242
@CaptainIRS made their first contribution in #1307
@AnanyaA-9 made their first contribution in #1370
@dennis2030 made their first contribution in #1386
@ankitm3k made their first contribution in #1389
@RyanMetcalfeInt8 made their first contribution in #1399

Full Changelog: v0.7.1...v0.8.0

Release Notes

Add AMD Quark Quantizer Support #1207
Added Gemma 3 to model builder #1359
Updated Phi-3 Python Q&A example to be consistent with C++ example #1392
Updated Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 #1388
Added OnnxRuntimeGenAIChatClient constructor with Config #1364
Improve and Fix TopKTopP #1363
Switch the order of softmax on CPU Top K #1354
Updated custom nuget packaging logic #1377
Updated pybind and fix rpath for macos and check for nullptr #1367
Convert tokens to list for concat to accommodate breaking API change in tokenizer #1358

Release Notes

We are excited to announce the release of onnxruntime-genai version 0.7.0. Below are the key updates included in this release:

Support for a wider variety of QNN NPU models (such as Deepseek R1)
Remove onnxruntime-genai static library. All language bindings now interface with onnxruntime-genai through the onnxruntime-genai shared library.
- All return types from onnxruntime-genai python package is now a numpy array type.
- Previously the return type from tokenizer.encode was a python list. This broke examples/python/model-qa.py which was using '+' to concatenate two lists. np.concatenate must be used instead for these cases.
Abstract away execution provider specific code into shared libraries of their own (for example onnxruntime-genai-cuda for cuda, and onnxruntime-genai-dml for dml). This allows using the onnxruntime-genai-cuda package to also work on non cuda machines (as an example).
Support for multi-modal models (text, speech, and vision) such as phi4-multi-modal.
Add an IChatClient implementation to the onnxruntime-genai C# bindings.
Expose the model type through the Python bindings.
Code and performance improvements for DML EP.

This release also includes several bug fixes that resolve issues reported by users.

Releases: microsoft/onnxruntime-genai

v0.10.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.2

Uh oh!

v0.9.1

Contributors

Uh oh!

v0.9.0

What's Changed

New Features

Model Builder Changes

Bug fixes

Packaging/Testing/Pipelines

Compliance

Documentation and Examples

Contributors

Uh oh!

v0.8.3

Contributors

Uh oh!

v0.8.2

What's changed

New features

Bug fixes

Packaging and testing

Uh oh!

v0.8.1

What's changed

New features

Bug fixes

Examples and documentation

Model builder changes

Dependency updates

Uh oh!

v0.8.0

What's Changed

New Features

Model builder updates

Bug Fixes

Examples and Documentation

Packaging and Testing

Dependency Updates

New Contributors

Contributors

Uh oh!

v0.7.1

Release Notes

Uh oh!

v0.7.0

Release Notes

Uh oh!