Skip to content

v1.0.0rc6

Pre-release
Pre-release
Compare
Choose a tag to compare
@Superjomn Superjomn released this 07 Aug 10:54
· 348 commits to main since this release
a16ba64

Announcement Highlights:

  • Model Support

  • Feature

    • Add LoRA support for Gemma3 (#6371)
    • Add support of scheduling attention dp request (#6246)
    • Multi-block mode for Hopper spec dec XQA kernel (#4416)
    • LLM sleep & wakeup Part 1: virtual device memory (#5034)
    • best_of/n for pytorch workflow (#5997)
    • Add speculative metrics for trt llm bench (#6476)
    • (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
    • Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
    • check input tokens + improve error handling (#5170)
    • Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
    • Add vLLM KV Pool support for XQA kernel (#6013)
    • Switch to internal version of MMProjector in Gemma3 (#6572)
    • Enable fp8 SwiGLU to minimize host overhead (#6540)
    • Add Qwen3 MoE support to TensorRT backend (#6470)
    • ucx establish connection with zmq (#6090)
    • Disable add special tokens for Llama3.3 70B (#6482)
  • API

  • Benchmark

    • ADP schedule balance optimization (#6061)
    • allreduce benchmark for torch (#6271)
  • Documentation

    • Make example SLURM scripts more parameterized (#6511)
    • blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
    • Exposing the latest tech blogs in README.md (#6553)
    • update known issues (#6247)
    • trtllm-serve doc improvement. (#5220)
    • Adding GPT-OSS Deployment Guide documentation (#6637)
    • Exposing the GPT OSS model support blog (#6647)
    • Add llama4 hybrid guide (#6640)
    • Add DeepSeek R1 deployment guide. (#6579)
    • Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
  • Known Issues

    • On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

  • [fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
  • [TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
  • [fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
  • refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
  • chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
  • fix: Fix missing key by @zerollzeng in #6471
  • [https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
  • [TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
  • [https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
  • [None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
  • [None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
  • [None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
  • [None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
  • [None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
  • [https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
  • [TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
  • [https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
  • [None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
  • [AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
  • [TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
  • [TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
  • [None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
  • [None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
  • [None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
  • [None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
  • [TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
  • use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
  • [None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
  • [TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
  • [fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
  • chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
  • test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
  • [None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
  • [None][infra] update namelist by @niukuo in #6465
  • [https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
  • [None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
  • test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
  • [TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
  • [None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
  • [None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
  • [TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
  • [None][fix] remove closed bugs by @xinhe-nv in #6576
  • [None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
  • [None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
  • [None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
  • [None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
  • [TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
  • [None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
  • [None][test] update invalid test name by @crazydemo in #6596
  • [TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
  • [None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
  • [TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
  • [None][doc] Fix blog4 typo by @syuoni in #6612
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
  • [TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
  • [https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
  • [TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
  • [None][chore] Add readme for perf test by @ruodil in #6443
  • [https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
  • [None][chore] ucx establish connection with zmq by @chuangz0 in #6090
  • [TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
  • [None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
  • [TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in #6522
  • [TRTLLM-6893][infra] fix Build Docker Image tag issue by @ZhanruiSunCh in #6555
  • [https://nvbugs/5410279][test] resubmit timeout refactor by @crazydemo in #6337
  • [None][doc] add introduction doc on qa test by @crazydemo in #6535
  • [None][fix] fix kimi k2 serving and add test for Kimi-K2 by @pengbowang-nv in #6589
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6600
  • [None][infra] Split GB200 stages for each test by @EmmaQiaoCh in #6594
  • [https://nvbugs/5433581][infra] Temporarily disable Docker Image use wheel from build stage by @ZhanruiSunCh in #6630
  • [TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system by @venkywonka in #6464
  • [https://nvbugs/5252313][fix] Fix torch compile + MTP by @liji-nv in #6554
  • [None][doc] Adding GPT-OSS Deployment Guide documentation by @farshadghodsian in #6637
  • [TRTLLM-5508][feat] check input tokens + improve error handling by @ixlmar in #5170
  • [https://nvbugs/5355007][fix] Set enable_chunked_context as True by default in trtllm bench by @Wanli-Jiang in #6582
  • [None][feat] Add support for fused gate_up_proj scales for FP8 blockwise by @achartier in #6496
  • [TRTLLM-5500][infra] Update CODEOWNERS with new ownership rules for additional paths by @venkywonka in #6564
  • [None][feat] Refactor Llava-Next by @yechank-nvidia in #6478
  • [None][feat] Add vLLM KV Pool support for XQA kernel by @Ransiki in #6013
  • [None][opt] ADP schedule balance optimization by @yunruis in #6061
  • [None][feat] Switch to internal version of MMProjector in Gemma3 by @brb-nv in #6572
  • [TRTLLM-6263][feat] Enable fp8 SwiGLU to minimize host overhead by @JunyiXu-nv in #6540
  • [None][doc] Exposing the GPT OSS model support blog by @juney-nvidia in #6647
  • [None][doc] Add llama4 hybrid guide by @jiahanc in #6640
  • [TRTLLM-6764][test] add new feature cases in cluster(B200/GB200) and sanity test by @ruodil in #6650
  • [None][doc] Unify the tech blogs naming. by @nv-guomingz in #6649
  • [None][fix] Fix 6522 mpi.pkl5.intracomm.Request has wait not Wait by @netanel-haber in #6646
  • Update allreduce benchmark for torch by @Tabrizian in #6271
  • [None][test] align kv_frac in perf test with perflab and add more cases for 4 gpus GB200 by @ruodil in #6632
  • [https://nvbugs/5433581][fix] DeepGEMM installation on SBSA by @zongfeijing in #6588
  • [None][feat] Add Qwen3 MoE support to TensorRT backend by @gkswns0531 in #6470
  • [TRTLLM-5633][infra] Change the TOT repo to default-llm-repo for merge waive list by @yiqingy0 in #6605
  • [https://nvbugs/5433581][fix] Revert deep_gemm installation workaround for SBSA by @chzblych in #6666
  • [None][chore] Enhance trtllm-serve example test by @LinPoly in #6604
  • [https://nvbugs/5328160][fix] Unwaive disaggregated serving tests by @Tabrizian in #6644
  • [https://nvbugs/5430124][fix] Mistral mixture_text_image test case fix by @yechank-nvidia in #6648
  • [None][chore] add missing tests to test list by @Superjomn in #6590
  • [TRTLLM-6859][doc] Add DeepSeek R1 deployment guide. by @yuxianq in #6579
  • [None][doc] Create deployment guide for Llama4 Scout FP8 and NVFP4 by @chenfeiz0326 in #6550

New Contributors

Full Changelog: v1.0.0rc5...v1.0.0rc6