CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 219
Releases: microsoft/onnxruntime-genai
v0.10.0
6deb570
Compare
What's Changed
- Enable continuous decoding for NvTensorRtRtx EP by @anujj in #1697
- Use updated Decoder API with
skip_special_tokens
by @sayanshaw24 in #1722 - Update extensions to include memleak fix by @baijumeswani in #1724
- Support batch processing for whisper example by @jiafatom in #1723
- Update onnxruntime_extensions dependency version by @baijumeswani in #1725
- Include C++ header in native nuget and fix compiler warnings by @baijumeswani in #1727
- Update Microsoft.Extensions.AI to 9.8.0 by @rogerbarreto in #1689
- Update Extensions commit for Qwen 2.5 Chat Template Tools Fix by @sayanshaw24 in #1730
- Whisper Truncation Extensions Commit Update by @sayanshaw24 in #1735
- Enable Cuda Graph for TensorRtRtx by default by @anujj in #1734
- Update sampling benchmark by @tianleiwu in #1729
- Add Windows WinML x64 build workflow by @chrisdMSFT in #1740
- Fix CUDA synchronization issue between ORT-GenAI and TRT-RTX inference by @anujj in #1733
- Hello WindowsML by @chrisdMSFT in #1711
- [CUDA] sampling kernel improvements by @tianleiwu in #1732
- Update GitHub Actions to latest versions by @snnn in #1749
- Update WinML version to 1.8.2091 by @nieubank in #1750
- Address macos packaging pipeline issues by @baijumeswani in #1747
- ProviderOptions level device filtering and APIs to configure model level device filtering by @vortex-captain in #1744
- Fix string indexing bug with Phi-4 mm tokenization by @kunal-vaishnavi in #1751
- Fix TRT-RTX EP regression by @gaugarg-nv in #1754
- Fix typo in C API header by @kunal-vaishnavi in #1753
- Enable WinML by default in ADO pipelines by @chrisdMSFT in #1755
- Change default build configuration to 'relwithdebinfo' by @baijumeswani in #1757
- Pin cmake and vcpkg versions in macOS workflows by @snnn in #1760
- Add TRT_RTX support for onnxruntime-genai-trt-rtx wheel by @anujj in #1736
- rel-0.10.0 by @chrisdMSFT in #1767
- Microsoft.ML.OnnxRuntimeGenAI.WinML.props by @chrisdMSFT in #1776
- Warning fix - ort_genai.h by @chrisdMSFT in #1778
- Microsoft.ML.OnnxRuntimeGenAI.targets by @chrisdMSFT in #1781
New Contributors
Full Changelog: v0.9.2...v0.10.0
Assets 13
- sha256:b0712db21ee3c425ecba486a5302101798105c92d137d09d7b1b499179d7723652.5 MB
2025-10-10T17:26:34Z - sha256:fd0f4e77f4aac03bf6a7dca70727e6b857d1d471a6a369518294dc026f4b6c3739.5 MB
2025-10-10T17:26:37Z - sha256:2bed5ae68b41f03af2f2acb2b1cd413a1403326c9889c3705c0541637851e02f3.01 MB
2025-10-10T17:26:37Z - sha256:ba950a7e1497c3b3419c1e2f8600e8c2a1812ad1956dfc276403d94e275b3d053.19 MB
2025-10-10T17:26:34Z - sha256:5b368f88cceaf4452b1577c1c23a551080250b21110461c3d83630480dea68c113.9 MB
2025-10-10T17:26:28Z - sha256:3b6eb23f6612fdbbd0169caabbea490ec99594f2ae2d36ffb871c918ba8eb92813.1 MB
2025-10-10T17:26:32Z - sha256:69b5a1c0f295aa89adaf00735d09960f62f01272386f05747a815ae0d7fe11a725.9 MB
2025-10-10T17:26:30Z - sha256:01ad247818f31b1308fce221feafe183c7ed6a52a63db3767734b11ace77a53514.7 MB
2025-10-10T17:26:26Z - sha256:6af40a32c5ba4dcd095c2f4d6e2b2a81b9dc8c02a82696eb6231e7e2a264788413.8 MB
2025-10-10T17:26:31Z - sha256:44ac1cc429807ad919a26136712c5d797b92d5540984dae5f67306bf6e2a523529.1 MB
2025-10-10T17:26:42Z -
2025-09-19T00:06:24Z -
2025-09-19T00:06:24Z - Loading
v0.9.2
Compare
This release fixes a pre-processing bug with Phi-4 multimodal.
Full Changelog: v0.9.1...v0.9.2
Assets 11
v0.9.1
41211b8
Compare
🚀 Features
Support for Continuous Batching (#1580) by @baijumeswani
RegisterExecutionProviderLibrary (#1628) by @vortex-captain
Enable CUDA graph for LLMs for NvTensorRtRtx EP (#1645) by @anujj
Add support for smollm3 (#1666) by @xenova
Add OpenAI's gpt-oss to ONNX Runtime GenAI (#1678) by @kunal-vaishnavi
Add custom ops library path resolution using EP metadata (#1707) by @psakhamoori
Use OnnxRuntime API wrapper for EP device operations (#1719) by @psakhamoori
🛠 Improvements
Update Extensions Commit to Support Strft Custom Function for Chat Template (#1670) by @sayanshaw24
Add parameters to chat template in chat example (#1673) by @kunal-vaishnavi
Update how Hugging Face's config files are processed (#1693) by @kunal-vaishnavi
Tie embedding weight sharing (#1690) by @jiafatom
Improve top-k sampling CUDA kernel (#1708) by @gaugarg-nv
🐛 Bug Fixes
Fix accessing final norm for Gemma-3 models (#1687) by @kunal-vaishnavi
Fix runtime bugs with multi-modal models (#1701) by @kunal-vaishnavi
Fix BF16 CUDA version of OpenAI's gpt-oss (#1706) by @kunal-vaishnavi
Fix benchmark_e2e (#1702) by @jiafatom
Fix benchmark_multimodal (#1714) by @jiafatom
Fix pad vs. eos token misidentification (#1694) by @aciddelgado
⚡ Performance & EP Enhancements
NvTensorRtRtx: Support num_beam > 1 (#1688) by @anujj
NvTensorRtRtx: Skip if node of Phi4 models (#1696) by @anujj
Remove QDQ and Opset Coupling for TRT RTX EP (#1692) by @xiaoyu-work
🔒 Build & CI
Enable Security Protocols in MSVC for BinSkim (#1672) by @sayanshaw24
Explicitly specify setup-java architecture in win-cpu-arm64-build.yml (#1685) by @edgchen1
Use dotnet instead of nuget in mac build (#1717) by @natke
📦 Versioning & Release
Update version to 0.10.0 (#1676) by @ajindal1
Cherrypick 0: Forgot to change versions (#1721) by @aciddelgado
Cherrypick 1... Becomes RC1 (#1726) by @aciddelgado
Cherrypick 2 (#1743) by @aciddelgado
🙌 New Contributors
@xiaoyu-work (#1692)
@psakhamoori (#1707)
✅ Full Changelog: v0.9.0...v0.9.1
Assets 11
v0.9.0
Compare
What's Changed
New Features
- Constrained decoding integration by @ajindal1 in #1381
- Update constrained decoding by @ajindal1 in #1477
- Enable TRT multi profile option though provider option by @anujj in #1493
- Add support for Machine Translation model by @apsonawane in #1482
- Overlap prompt processing KV cache update for WindowedKeyValueCache in DecoderOnlyPipelineState by @edgchen1 in #1526
- Add basic support for tracing by @edgchen1 in #1524
- Logging SetLogCallback + Debugging cleanup by @RyanUnderhill in #1471
- Support loading models from memory by @baijumeswani in #1571
- Add SLM Engine support function calling by @kinfey in #1582
- Pass the batch_size thought the Overlay by @anujj in #1627
- Enable GPU based sampling for TRT-RTX by @gaugarg-nv in #1650
Model Builder Changes
- Whisper Redesigned Solution by @kunal-vaishnavi in #1229
- [Builder] Add support for Olive quantized models by @jambayk in #1647
- Add Qwen3 to model builder by @xenova in #1428
- Model builder: Add ability to exclude a node from quantization by @sushraja-msft in #1436
- Support k_quant in model builder by @jiafatom in #1444
- Add final norm for LoRA models by @kunal-vaishnavi in #1446
- Add bfloat16 support in model builder by @kunal-vaishnavi in #1447
- Fix accuracy issues with Gemma models by @kunal-vaishnavi in #1448
- Always cast bf16 logits to fp32 by @nenad1002 in #1479
- NvTensorRtRtx EP option in GenAI - model builder by @BLSharda in #1453
- Add Gemma3 Model support for NvTensorRtRtx execution provider by @anujj in #1520
- Use IRv10 in the model builder by @justinchuby in #1547
- [Builder] Rename methods make_value and make_initializer by @justinchuby in #1554
- Always use opset21 in builder by @justinchuby in #1548
- Clamp KV Cache Size to Sliding Window for NvTensorRtRtx EP by @BLSharda in #1523
- [Builder] Fix output name in make_rotary_embedding_multi_cache by @justinchuby in #1562
- [Builder] Use lazy tensor by @justinchuby in #1556
- [Builder] Fix KeyError for torch.uint8 in dtype mapping for MoE quantization by @Copilot in #1561
- [Builder] Fix 1d constant creation by @justinchuby in #1568
- [Builder] Create progress bar by @justinchuby in #1559
- [Builder] Use packed 4bit tensors directly by @justinchuby in #1566
- [Builder] Simplify constant creation by @justinchuby in #1569
- [Builder] Add cuda-bfloat16 entry to valid_gqa_configurations by @justinchuby in #1585
- [Builder] use dtype conversion helpers from onnx_ir by @justinchuby in #1587
- [Model builder] Add support for Ernie 4.5 models by @xenova in #1608
- whisper: Allow session options to be used for encoder by @RyanMetcalfeInt8 in #1622
- Make default top_k=50 in model builder by @jiafatom in #1642
- Update builder.py by @lnigam in #1665
- Change IO dtype for INT4 CUDA models by @kunal-vaishnavi in #1629
Bug fixes
- CUDA Top K / Top P Fixes by @aciddelgado in #1371
- Persist provider options across ClearProviders, AppendProvider where possible by @baijumeswani in #1454
- Add enable_skip_layer_norm_strict_mode flag by @nenad1002 in #1462
- Avoid adding providers if not requested by @baijumeswani in #1464
- Fix array eos_token_id handling by @RyanUnderhill in #1463
- Remove BF16 CPU from valid GQA configuration by @nenad1002 in #1469
- Address QNN specific regressions by @baijumeswani in #1470
- Fix how torch tensors are saved by @kunal-vaishnavi in #1476
- Fix model chat example for rewind by @ajindal1 in #1480
- Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1497
- Fix missing parameter name by @xadupre in #1502
- Fix from pretrained method for quantized models by @kunal-vaishnavi in #1503
- Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj in #1505
- Fix last layer generation for text-only models by @nenad1002 in #1513
- [Fix] Remove references to TensorProto by @justinchuby in #1549
- Fix make_layernorm_casts usage of value infos by @justinchuby in #1551
- Fix DML Memory Leak by @aciddelgado in #1578
- [DML] Bind the dml global objects to the Model by @baijumeswani in #1590
- NvTensorRTRTx: Enable CUDA graph via config and fix attention_mask shape handling by @anujj in #1594
- Append eos token to the end of input sequence for marian models by @apsonawane in #1630
- Use two-step Softmax to do cuda sampling by @jiafatom in #1617
- Use two-step softmax for CPU sampling by @jiafatom in #1631
- Use last windowed input ids to update logits by @baijumeswani in #1636
- Fix attention‑mask stride bug for static masking (batch > 1) by @anujj in #1639
- Add open bytes functionality for C# by @ajindal1 in #1634
Packaging/Testing/Pipelines
- Sign macos binaries by @baijumeswani in #1439
- Add chat template tests by @sayanshaw24 in #1457
- Update triggers by @snnn in #1490
- Add support for building a cuda + dml package by @baijumeswani in #1600
- NvTensorRtRtx: Pass the dynamic shapes (ISL and batch_size) to the ep at runtime as nv profile. by @anujj in #1614
- Update docker image by @snnn in #1633
- sign all genai dlls, in both onnxunrime-genai and python targets by @vortex-captain in #1635
- Fixes all packaging pipelines by @baijumeswani in #1641
- Update the benchmark scripts to account for the time spent in sampling by @gaugarg-nv in #1646
- Add date for nightly packages by @ajindal1 in #1668
Compliance
- Enable policheck in packaging pipeline by @apsonawane in #1449
- Add third party notices in file exclusion by @apsonawane in #1459
- Enable tsa options in packaging pipelines by @apsonawane in #1460
- Update windows packaging pipelines to use build.py by @aciddelgado in #1468
Documentation and Examples
- Update OnnxRuntimeGenAIChatClient with chat template and guidance by @stephentoub in #1533
- Update SimpleGenAI.java docs by @edgc...
Assets 11
v0.8.3
dc2d850
Compare
Assets 13
v0.8.2
fea4e96
Compare
What's changed
New features
- Use Accuracy level 4 for webgpu by default by @guschmue (#1474)
- Enable guidance by default on macos by @ajindal1 (#1514)
Bug fixes
- Remove position_id and fix context phase KV shapes for in-place cache buffer support by @anujj (#1505)
- Update Extensions Commit for 0.8.2 by @sayanshaw24 (#1519)
- Update Extensions Commit for another DeepSeek Fix by @sayanshaw24 (#1521)
Packaging and testing
Full Changelog: v0.8.1...v0.8.2
Assets 13
v0.8.1
caba648
Compare
What's changed
New features
- Integrate tools input into Chat Template API by @sayanshaw24 (#1472)
- NvTensorRtRtx EP option in GenAI - model builder by @BLSharda (#1453)
- Enable TRT multi profile option though provider option by @anujj (#1493)
Bug fixes
- Always cast bf16 logits to fp32 by @nenad1002 (#1479)
Examples and documentation
- Update Chat Template Examples for Tools API change by @sayanshaw24 (#1506)
- Fix model chat example for rewind by @ajindal1 (#1480)
Model builder changes
- Fix from pretrained method for quantized models by @kunal-vaishnavi (#1503)
- Fix missing parameter name by @xadupre (#1502)
- minor change to support qwen3 by @guschmue (#1499)
- Fix how torch tensors are saved by @kunal-vaishnavi (#1476)
- Support k_quant in model builder by @jiafatom (#1444)
Dependency updates
- Update to stable release of Microsoft.Extensions.AI.Abstractions by @stephentoub (#1489)
- Update to M.E.AI 9.4.3-preview.1.25230.7 by @stephentoub (#1443)
Full Changelog: v0.8.0...v0.8.1
Assets 13
v0.8.0
Compare
What's Changed
New Features
- Add Chat Template API Changes by @sayanshaw24 in #1398
- Add Python and C# bindings for Chat Template API by @sayanshaw24 in #1411
- Support for gemma3 model by @baijumeswani in #1374
- Support more QNN models with different model structures by @baijumeswani in #1322
- Add ability to load audio from bytes, to match images API by @RyanUnderhill in #1304
- Add support for DML Graph Capture to improve speed by @aciddelgado in #1305
- Added OnnxRuntimeGenAIChatClient ctor with Config. by @azchohfi in #1364
- Extensible AppendExecutionProvider and expose OrtSessionOptions::AddConfigEntry directly by @RyanUnderhill in #1384
- OpenVINO: Model Managed KVCache by @RyanMetcalfeInt8 in #1399
- Changes how the device OrtAllocators work, use a global OrtSession instead by @RyanUnderhill in #1378
- Remove audio attention mask processing and update ort-extensions by @baijumeswani in #1319
- Simplify the C API definitions and prevent any type mismatches going forward by @RyanUnderhill in #1365
Model builder updates
- Quark Quantizer Support by @shobrienDMA in #1207
- Add Gemma 3 to model builder by @kunal-vaishnavi in #1359
- Initial support for VitisAI EP by @AnanyaA-9 in #1370
- [OVEP] feat: Adding OpenVINO EP in ORT-GenAI by @ankitm3k in #1389
- Initial support for NV EP by @BLSharda in #1404
- Adapt to MatMulNBitsQuantizer in ort by @jiafatom in #1426
- Fix LM head for Gemma-2 by @kunal-vaishnavi in #1420
Bug Fixes
- Fix mismatch in Java bindings by @CaptainIRS in #1307
- Fix type mismatch in Java bindings by @CaptainIRS in #1313
- Update ort-extensions to fix tokenizer bug for phi4 by @baijumeswani in #1331
- Windows: Show more useful DLL load errors to say exactly what DLL is missing by @RyanUnderhill in #1345
- deprecate graph cap by @aciddelgado in #1338
- Support load/unload of models to avoid QNN errors on deepseek r1 1.5B by @baijumeswani in #1346
- Add missing 'value_stats' to logging API, and fix wrong default by @RyanUnderhill in #1353
- Convert tokens to list for concat by @ajindal1 in #1358
- Improve and Fix TopKTopP by @jiafatom in #1363
- Switch the order of softmax on CPU Top K by @aciddelgado in #1354
- Update pybind and fix rpath for macos and check for nullptr by @baijumeswani in #1367
- iterate over the providers by @baijumeswani in #1486
- Correctly iterate over the providers to check if graph capture is enabled by @baijumeswani in #1487
Examples and Documentation
- Update README.md by @RyanUnderhill in #1372
- Add slm engine example by @avijit-chakroborty in #1242
- Added cancellation to the streaming method of OnnxRuntimeGenAIChatClient. by @azchohfi in #1289
- Update nuget README with latest API by @natke in #1326
- Update C examples downloads by @ajindal1 in #1332
- Add Q&A Test Example in Nightly by @ajindal1 in #1277
- docs: update the doc of slm_engine to ensure consistency with the code by @dennis2030 in #1386
- C++ and python samples: follow_config support by @RyanMetcalfeInt8 in #1413
- Fix Do Sample example by @ajindal1 in #1337
- Make phi3 example Q&A rather than chat by @ajindal1 in #1392
- Fix broken link in package description by @rogerbarreto in #1360
Packaging and Testing
- Remove DirectML.dll dependency by @baijumeswani in #1342
- Add support to creating a custom nuget in the packaging pipeline by @baijumeswani in #1315
- Remove onnxruntime-genai-static library (non trivial change) by @RyanUnderhill in #1264
- Add macosx to custom nuget package by @baijumeswani in #1419
- Update the C++ clang-format lint workflow to use clang 20 by @snnn in #1418
- Add model_benchmark options to specify prompt to use. by @edgchen1 in #1328
- Add value_stats logging option to show statistical information about … by @RyanUnderhill in #1352
- Fixed the MacOS build and updated the test script. by @avijit-chakroborty in #1310
- Fix iOS packaging pipeline after static library removal by @RyanUnderhill in #1316
- fix bug in python benchmark script by @thevishalagarwal in #1206
- Fix macos package by @baijumeswani in #1347
- Missing *.dylib in package_data, so Mac would not package our shared libraries by @RyanUnderhill in #1341
Dependency Updates
- Update upload Artifact version by @ajindal1 in #1274
- Update to M.E.AI 9.3.0-preview.1.25161.3 by @stephentoub in #1317
- Update android min sdk version to 24 by @baijumeswani in #1324
- Update torch to 2.5.1 by @baijumeswani in #1343
- Update Pipelines for S360 by @ajindal1 in #1323
- Update Nuget pkg name by @ajindal1 in #1351
- update version to 0.8.0 by @baijumeswani in #1376
- Update custom nuget packaging logic by @baijumeswani in #1377
- Update Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 by @stephentoub in #1388
- Bump torch from 2.5.1 to 2.6.0 in /test/python/macos/torch by @dependabot in #1408
- Bump torch from 2.5.1+cu124 to 2.6.0+cu124 in /test/python/cuda/torch by @dependabot in #1409
- Bump torch from 2.5.1+cpu to 2.7.0 in /test/python/cpu/torch by @dependabot in #1422
- pin cmake version by @snnn in #1424
New Contributors
- @avijit-chakroborty made their first contribution in #1242
- @CaptainIRS made their first contribution in #1307
- @AnanyaA-9 made their first contribution in #1370
- @dennis2030 made their first contribution in #1386
- @ankitm3k made their first contribution in #1389
- @RyanMetcalfeInt8 made their first contribution in #1399
Full Changelog: v0.7.1...v0.8.0
Assets 13
v0.7.1
efab081
Compare
Release Notes
- Add AMD Quark Quantizer Support #1207
- Added Gemma 3 to model builder #1359
- Updated Phi-3 Python Q&A example to be consistent with C++ example #1392
- Updated Microsoft.Extensions.AI.Abstractions to 9.4.0-preview.1.25207.5 #1388
- Added OnnxRuntimeGenAIChatClient constructor with Config #1364
- Improve and Fix TopKTopP #1363
- Switch the order of softmax on CPU Top K #1354
- Updated custom nuget packaging logic #1377
- Updated pybind and fix rpath for macos and check for nullptr #1367
- Convert tokens to list for concat to accommodate breaking API change in tokenizer #1358
Assets 13
v0.7.0
8a48d7b
Compare
Release Notes
We are excited to announce the release of onnxruntime-genai
version 0.7.0. Below are the key updates included in this release:
- Support for a wider variety of QNN NPU models (such as Deepseek R1)
- Remove
onnxruntime-genai
static library. All language bindings now interface withonnxruntime-genai
through theonnxruntime-genai
shared library.- All return types from
onnxruntime-genai
python package is now a numpy array type. - Previously the return type from tokenizer.encode was a python list. This broke examples/python/model-qa.py which was using '+' to concatenate two lists. np.concatenate must be used instead for these cases.
- All return types from
- Abstract away execution provider specific code into shared libraries of their own (for example
onnxruntime-genai-cuda
for cuda, andonnxruntime-genai-dml
for dml). This allows using the onnxruntime-genai-cuda package to also work on non cuda machines (as an example). - Support for multi-modal models (text, speech, and vision) such as phi4-multi-modal.
- Add an IChatClient implementation to the
onnxruntime-genai
C# bindings. - Expose the model type through the Python bindings.
- Code and performance improvements for DML EP.
This release also includes several bug fixes that resolve issues reported by users.