Release v1.1.0rc1 · NVIDIA/TensorRT-LLM

Announcement Highlights:

Model Support
- Add Tencent HunYuanMoEV1 model support (#5521)
- Support Yarn on Qwen3 (#6785)
API
- BREAKING CHANGE: Introduce sampler_type, detect sampler according to options (#6831)
- Introduce sampler options in trtllm bench (#6855)
- Support accurate device iter time (#6906)
- Add batch wait timeout in fetching requests (#6923)
Benchmark
- Add accuracy evaluation for AutoDeploy (#6764)
- Add accuracy test for context and generation workers with different models (#6741)
- Add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
- Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] (#6939)
- Add NIM Related Cases Part 1 (#6684)
Feature
- Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
- Add single block version renormalized routing kernel (#6756)
- Use Separate QKV Input Layout for Context MLA (#6538)
- Enable accuracy test for MTP and chunked prefill (#6314)
Documentation
- Update gpt-oss doc on MoE support matrix (#6908)
- Modify the description for MLA chunked context (#6929)
- Update wide-ep doc (#6933)
- Update gpt oss doc (#6954)
- Add more documents for large scale EP (#7029)
- Add documentation for relaxed test threshold (#6997)

What's Changed

[https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
[https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
[None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
[None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
[None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
[https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
[https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
[None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
[None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
[None][fix] Fix perfect router. by @bobboli in #6797
[https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
[None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
[None][doc] Modify the description for mla chunked context by @jmydurant in #6929
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
[None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
[https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
[https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
[None] [chore] Mamba cache in separate file by @tomeras91 in #6796
[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
[https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
[https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
[None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
[None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
[https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
[None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
[TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
[None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
[None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
[None][feat] Support Yarn on Qwen3 by @byshiue in #6785
[None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
[None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
[https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
[https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
[None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
[None][doc] Update gpt oss doc by @bobboli in #6954
[None] [feat] Support accurate device iter time by @kaiyux in #6906
[TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
[None] [fix] Fix the macro name by @ChristinaZ in #6983
[None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
[None][chore] Remove duplicate test waives by @yiqingy0 in #6998
[None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
[TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
[None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
[None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
[https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
[TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
[None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
[None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
[TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
[TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
[TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
[https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
[TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
[https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
[None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
[None][chore] Remove closed bugs by @xinhe-nv in #6969
[None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
[None] [doc] Add more documents for large scale EP by @kaiyux in #7029
[None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
[TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
[https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
[https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
[None][chore] Remove duplicate test waives by @yiqingy0 in #7044
[None][infra] Waive failed tests on main 08/19 by @EmmaQiaoCh in #7037
[None][feat] Use Separate QKV Input Layout for Context MLA by @zhhuang-nv in #6538
[https://nvbugs/5444937][chore] Fixing KV events tests by @pcastonguay in #7004
[https://nvbugs/5451296][bug] Cherry-pick #7017 from release/1.0 branch by @chzblych in #7043
[None][fix] Accommodate Phi3/4 to work with ModelOpt's FP8 ckpts in Torch by @moraxu in #6761

New Contributors

@qianbiaoxiang made their first contribution in #5521
@ajrasane made their first contribution in #6764
@fredricz-20070104 made their first contribution in #6939

Full Changelog: v1.1.0rc0...v1.1.0rc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.1.0rc1

What's Changed

New Contributors

Contributors

Uh oh!