Skip to content

v1.1.0rc1

Pre-release
Pre-release
Compare
Choose a tag to compare
@Superjomn Superjomn released this 22 Aug 10:02
· 152 commits to main since this release
7334f93

Announcement Highlights:

  • Model Support

    • Add Tencent HunYuanMoEV1 model support (#5521)
    • Support Yarn on Qwen3 (#6785)
  • API

    • BREAKING CHANGE: Introduce sampler_type, detect sampler according to options (#6831)
    • Introduce sampler options in trtllm bench (#6855)
    • Support accurate device iter time (#6906)
    • Add batch wait timeout in fetching requests (#6923)
  • Benchmark

    • Add accuracy evaluation for AutoDeploy (#6764)
    • Add accuracy test for context and generation workers with different models (#6741)
    • Add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
    • Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] (#6939)
    • Add NIM Related Cases Part 1 (#6684)
  • Feature

    • Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
    • Add single block version renormalized routing kernel (#6756)
    • Use Separate QKV Input Layout for Context MLA (#6538)
    • Enable accuracy test for MTP and chunked prefill (#6314)
  • Documentation

    • Update gpt-oss doc on MoE support matrix (#6908)
    • Modify the description for MLA chunked context (#6929)
    • Update wide-ep doc (#6933)
    • Update gpt oss doc (#6954)
    • Add more documents for large scale EP (#7029)
    • Add documentation for relaxed test threshold (#6997)

What's Changed

  • [https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
  • [https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
  • [None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
  • [None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
  • [None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
  • [https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
  • [https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
  • [None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
  • [None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
  • [None][fix] Fix perfect router. by @bobboli in #6797
  • [https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
  • [None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
  • [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
  • [None][doc] Modify the description for mla chunked context by @jmydurant in #6929
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
  • [None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
  • [https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
  • [https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
  • [None] [chore] Mamba cache in separate file by @tomeras91 in #6796
  • [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
  • [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
  • [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
  • [None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
  • [None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
  • [TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
  • [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
  • [None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
  • [TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
  • [None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
  • [None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
  • [None][feat] Support Yarn on Qwen3 by @byshiue in #6785
  • [None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
  • [None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
  • [https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
  • [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
  • [None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
  • [None][doc] Update gpt oss doc by @bobboli in #6954
  • [None] [feat] Support accurate device iter time by @kaiyux in #6906
  • [TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
  • [None] [fix] Fix the macro name by @ChristinaZ in #6983
  • [None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
  • [None][chore] Remove duplicate test waives by @yiqingy0 in #6998
  • [None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
  • [None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
  • [TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
  • [None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
  • [None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
  • [https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
  • [TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
  • [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
  • [None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
  • [TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
  • [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
  • [TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
  • [https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
  • [TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
  • [https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
  • [None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
  • [None][chore] Remove closed bugs by @xinhe-nv in #6969
  • [None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
  • [None] [doc] Add more documents for large scale EP by @kaiyux in #7029
  • [None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
  • [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
  • [https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
  • [https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
  • [None][chore] Remove duplicate test waives by @yiqingy0 in #7044
  • [None][infra] Waive failed tests on main 08/19 by @EmmaQiaoCh in #7037
  • [None][feat] Use Separate QKV Input Layout for Context MLA by @zhhuang-nv in #6538
  • [https://nvbugs/5444937][chore] Fixing KV events tests by @pcastonguay in #7004
  • [https://nvbugs/5451296][bug] Cherry-pick #7017 from release/1.0 branch by @chzblych in #7043
  • [None][fix] Accommodate Phi3/4 to work with ModelOpt's FP8 ckpts in Torch by @moraxu in #6761

New Contributors

Full Changelog: v1.1.0rc0...v1.1.0rc1