Release v1.0.0rc6 · NVIDIA/TensorRT-LLM

Announcement Highlights:

Model Support
Feature
- Add LoRA support for Gemma3 (#6371)
- Add support of scheduling attention dp request (#6246)
- Multi-block mode for Hopper spec dec XQA kernel (#4416)
- LLM sleep & wakeup Part 1: virtual device memory (#5034)
- best_of/n for pytorch workflow (#5997)
- Add speculative metrics for trt llm bench (#6476)
- (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
- check input tokens + improve error handling (#5170)
- Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
- Add vLLM KV Pool support for XQA kernel (#6013)
- Switch to internal version of MMProjector in Gemma3 (#6572)
- Enable fp8 SwiGLU to minimize host overhead (#6540)
- Add Qwen3 MoE support to TensorRT backend (#6470)
- ucx establish connection with zmq (#6090)
- Disable add special tokens for Llama3.3 70B (#6482)
API
Benchmark
- ADP schedule balance optimization (#6061)
- allreduce benchmark for torch (#6271)
Documentation
- Make example SLURM scripts more parameterized (#6511)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
- Exposing the latest tech blogs in README.md (#6553)
- update known issues (#6247)
- trtllm-serve doc improvement. (#5220)
- Adding GPT-OSS Deployment Guide documentation (#6637)
- Exposing the GPT OSS model support blog (#6647)
- Add llama4 hybrid guide (#6640)
- Add DeepSeek R1 deployment guide. (#6579)
- Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
Known Issues
- On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

[fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
[TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
[fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
fix: Fix missing key by @zerollzeng in #6471
[https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
[TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
[https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
[None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
[None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
[None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
[None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
[None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
[https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
[TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
[https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
[None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
[AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
[TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
[TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
[None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
[None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
[None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
[None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
[None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
[TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
[fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
[None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
[None][infra] update namelist by @niukuo in #6465
[https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
[None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
[TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
[None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
[None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
[TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
[None][fix] remove closed bugs by @xinhe-nv in #6576
[None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
[None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
[None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
[None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
[TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
[None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
[None][test] update invalid test name by @crazydemo in #6596
[TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
[None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
[None][doc] Fix blog4 typo by @syuoni in #6612
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
[TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
[https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
[TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
[None][chore] Add readme for perf test by @ruodil in #6443
[https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
[None][chore] ucx establish connection with zmq by @chuangz0 in #6090
[TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
[None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
[TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in #6522
[TRTLLM-6893][infra] fix Build Docker Image tag issue by @ZhanruiSunCh in #6555
[https://nvbugs/5410279][test] resubmit timeout refactor by @crazydemo in #6337
[None][doc] add introduction doc on qa test by @crazydemo in #6535
[None][fix] fix kimi k2 serving and add test for Kimi-K2 by @pengbowang-nv in #6589
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6600
[None][infra] Split GB200 stages for each test by @EmmaQiaoCh in #6594
[https://nvbugs/5433581][infra] Temporarily disable Docker Image use wheel from build stage by @ZhanruiSunCh in #6630
[TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system by @venkywonka in #6464
[https://nvbugs/5252313][fix] Fix torch compile + MTP by @liji-nv in #6554
[None][doc] Adding GPT-OSS Deployment Guide documentation by @farshadghodsian in #6637
[TRTLLM-5508][feat] check input tokens + improve error handling by @ixlmar in #5170
[https://nvbugs/5355007][fix] Set enable_chunked_context as True by default in trtllm bench by @Wanli-Jiang in #6582
[None][feat] Add support for fused gate_up_proj scales for FP8 blockwise by @achartier in #6496
[TRTLLM-5500][infra] Update CODEOWNERS with new ownership rules for additional paths by @venkywonka in #6564
[None][feat] Refactor Llava-Next by @yechank-nvidia in #6478
[None][feat] Add vLLM KV Pool support for XQA kernel by @Ransiki in #6013
[None][opt] ADP schedule balance optimization by @yunruis in #6061
[None][feat] Switch to internal version of MMProjector in Gemma3 by @brb-nv in #6572
[TRTLLM-6263][feat] Enable fp8 SwiGLU to minimize host overhead by @JunyiXu-nv in #6540
[None][doc] Exposing the GPT OSS model support blog by @juney-nvidia in #6647
[None][doc] Add llama4 hybrid guide by @jiahanc in #6640
[TRTLLM-6764][test] add new feature cases in cluster(B200/GB200) and sanity test by @ruodil in #6650
[None][doc] Unify the tech blogs naming. by @nv-guomingz in #6649
[None][fix] Fix 6522 mpi.pkl5.intracomm.Request has wait not Wait by @netanel-haber in #6646
Update allreduce benchmark for torch by @Tabrizian in #6271
[None][test] align kv_frac in perf test with perflab and add more cases for 4 gpus GB200 by @ruodil in #6632
[https://nvbugs/5433581][fix] DeepGEMM installation on SBSA by @zongfeijing in #6588
[None][feat] Add Qwen3 MoE support to TensorRT backend by @gkswns0531 in #6470
[TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list by @yiqingy0 in #6605
[https://nvbugs/5433581][fix] Revert deep_gemm installation workaround for SBSA by @chzblych in #6666
[None][chore] Enhance trtllm-serve example test by @LinPoly in #6604
[https://nvbugs/5328160][fix] Unwaive disaggregated serving tests by @Tabrizian in #6644
[https://nvbugs/5430124][fix] Mistral mixture_text_image test case fix by @yechank-nvidia in #6648
[None][chore] add missing tests to test list by @Superjomn in #6590
[TRTLLM-6859][doc] Add DeepSeek R1 deployment guide. by @yuxianq in #6579
[None][doc] Create deployment guide for Llama4 Scout FP8 and NVFP4 by @chenfeiz0326 in #6550

New Contributors

@yali-arch made their first contribution in #6515
@richardhuo-nv made their first contribution in #6526
@Bruce-Lee-LY made their first contribution in #6573
@kris1025 made their first contribution in #6476
@symphonylyh made their first contribution in #6379
@farshadghodsian made their first contribution in #6637
@Ransiki made their first contribution in #6013
@JunyiXu-nv made their first contribution in #6540

Full Changelog: v1.0.0rc5...v1.0.0rc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.0.0rc6

What's Changed

New Contributors

Contributors

Uh oh!