v1.0.0rc6
Pre-release
Pre-release
Announcement Highlights:
-
Model Support
-
Feature
- Add LoRA support for Gemma3 (#6371)
- Add support of scheduling attention dp request (#6246)
- Multi-block mode for Hopper spec dec XQA kernel (#4416)
- LLM sleep & wakeup Part 1: virtual device memory (#5034)
- best_of/n for pytorch workflow (#5997)
- Add speculative metrics for trt llm bench (#6476)
- (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
- check input tokens + improve error handling (#5170)
- Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
- Add vLLM KV Pool support for XQA kernel (#6013)
- Switch to internal version of MMProjector in Gemma3 (#6572)
- Enable fp8 SwiGLU to minimize host overhead (#6540)
- Add Qwen3 MoE support to TensorRT backend (#6470)
- ucx establish connection with zmq (#6090)
- Disable add special tokens for Llama3.3 70B (#6482)
-
API
-
Benchmark
-
Documentation
- Make example SLURM scripts more parameterized (#6511)
- blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
- Exposing the latest tech blogs in README.md (#6553)
- update known issues (#6247)
- trtllm-serve doc improvement. (#5220)
- Adding GPT-OSS Deployment Guide documentation (#6637)
- Exposing the GPT OSS model support blog (#6647)
- Add llama4 hybrid guide (#6640)
- Add DeepSeek R1 deployment guide. (#6579)
- Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
-
Known Issues
- On bare-metal Ubuntu 22.04 or 24.04, please install the
cuda-python==12.9.1
package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of errorImportError: cannot import name 'cuda' from 'cuda'
.
- On bare-metal Ubuntu 22.04 or 24.04, please install the
What's Changed
- [fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
- [TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
- [fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
- refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
- chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
- fix: Fix missing key by @zerollzeng in #6471
- [https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
- [TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
- [https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
- [None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
- [None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
- [None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
- [None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
- [None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
- [https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
- [TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
- [https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
- [None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
- [AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
- [TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
- [TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
- [None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
- [None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
- [None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
- [None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
- [TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
- use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
- [None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
- [TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
- [fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
- chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
- test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
- [None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
- [None][infra] update namelist by @niukuo in #6465
- [https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
- [None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
- test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
- [TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
- [None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
- [None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
- [TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
- [None][fix] remove closed bugs by @xinhe-nv in #6576
- [None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
- [None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
- [None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
- [None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
- [TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
- [None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
- [None][test] update invalid test name by @crazydemo in #6596
- [TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
- [None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
- [TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
- [None][doc] Fix blog4 typo by @syuoni in #6612
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
- [TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
- [https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
- [TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
- [None][chore] Add readme for perf test by @ruodil in #6443
- [https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
- [None][chore] ucx establish connection with zmq by @chuangz0 in #6090
- [TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
- [None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
- [TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in #6522
- [TRTLLM-6893][infra] fix Build Docker Image tag issue by @ZhanruiSunCh in #6555
- [https://nvbugs/5410279][test] resubmit timeout refactor by @crazydemo in #6337
- [None][doc] add introduction doc on qa test by @crazydemo in #6535
- [None][fix] fix kimi k2 serving and add test for Kimi-K2 by @pengbowang-nv in #6589
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6600
- [None][infra] Split GB200 stages for each test by @EmmaQiaoCh in #6594
- [https://nvbugs/5433581][infra] Temporarily disable Docker Image use wheel from build stage by @ZhanruiSunCh in #6630
- [TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system by @venkywonka in #6464
- [https://nvbugs/5252313][fix] Fix torch compile + MTP by @liji-nv in #6554
- [None][doc] Adding GPT-OSS Deployment Guide documentation by @farshadghodsian in #6637
- [TRTLLM-5508][feat] check input tokens + improve error handling by @ixlmar in #5170
- [https://nvbugs/5355007][fix] Set
enable_chunked_context
as True by default in trtllm bench by @Wanli-Jiang in #6582 - [None][feat] Add support for fused gate_up_proj scales for FP8 blockwise by @achartier in #6496
- [TRTLLM-5500][infra] Update CODEOWNERS with new ownership rules for additional paths by @venkywonka in #6564
- [None][feat] Refactor Llava-Next by @yechank-nvidia in #6478
- [None][feat] Add vLLM KV Pool support for XQA kernel by @Ransiki in #6013
- [None][opt] ADP schedule balance optimization by @yunruis in #6061
- [None][feat] Switch to internal version of MMProjector in Gemma3 by @brb-nv in #6572
- [TRTLLM-6263][feat] Enable fp8 SwiGLU to minimize host overhead by @JunyiXu-nv in #6540
- [None][doc] Exposing the GPT OSS model support blog by @juney-nvidia in #6647
- [None][doc] Add llama4 hybrid guide by @jiahanc in #6640
- [TRTLLM-6764][test] add new feature cases in cluster(B200/GB200) and sanity test by @ruodil in #6650
- [None][doc] Unify the tech blogs naming. by @nv-guomingz in #6649
- [None][fix] Fix 6522 mpi.pkl5.intracomm.Request has wait not Wait by @netanel-haber in #6646
- Update allreduce benchmark for torch by @Tabrizian in #6271
- [None][test] align kv_frac in perf test with perflab and add more cases for 4 gpus GB200 by @ruodil in #6632
- [https://nvbugs/5433581][fix] DeepGEMM installation on SBSA by @zongfeijing in #6588
- [None][feat] Add Qwen3 MoE support to TensorRT backend by @gkswns0531 in #6470
- [TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list by @yiqingy0 in #6605
- [https://nvbugs/5433581][fix] Revert deep_gemm installation workaround for SBSA by @chzblych in #6666
- [None][chore] Enhance trtllm-serve example test by @LinPoly in #6604
- [https://nvbugs/5328160][fix] Unwaive disaggregated serving tests by @Tabrizian in #6644
- [https://nvbugs/5430124][fix] Mistral mixture_text_image test case fix by @yechank-nvidia in #6648
- [None][chore] add missing tests to test list by @Superjomn in #6590
- [TRTLLM-6859][doc] Add DeepSeek R1 deployment guide. by @yuxianq in #6579
- [None][doc] Create deployment guide for Llama4 Scout FP8 and NVFP4 by @chenfeiz0326 in #6550
New Contributors
- @yali-arch made their first contribution in #6515
- @richardhuo-nv made their first contribution in #6526
- @Bruce-Lee-LY made their first contribution in #6573
- @kris1025 made their first contribution in #6476
- @symphonylyh made their first contribution in #6379
- @farshadghodsian made their first contribution in #6637
- @Ransiki made their first contribution in #6013
- @JunyiXu-nv made their first contribution in #6540
Full Changelog: v1.0.0rc5...v1.0.0rc6