What's Changed
- hotfix: change MAX_JOBS in aot ci by @yzh119 in #1621
- fix: export MAX_JOBS for AOT build by @yongwww in #1626
- feat: initial support for SM103, SM110, SM120, SM121 by @aleozlx in #1608
- perf: Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #1615
- Fix cute dsl gemm API wrong arg name and silent error when passing wrong kwargs by @fzyzcjy in #1619
- bugfix: fix merge_attention_state in BatchAttention w/ gqa-group-size in Qwen family by @happierpig in #1614
- bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs in test_trtllm_mnnvl_allreduce by @bkryu in #1627
- ci: add cuda-13 unittests to CI by @yzh119 in #1603
- Revert "hotfix: change MAX_JOBS in aot ci (#1621)" by @yzh119 in #1629
- patch mm segfault & patch cubin avail. by @aleozlx in #1628
- bugfix: fix flashinfer_benchmark.py IMA when running a test list by @bkryu in #1625
- feat: cutlass fp4 gemm bringup for SM120 & SM121 by @yongwww in #1609
- feat: update flashinfer-cli by @yzh119 in #1613
- bugfix: trtllm-gen fmha sm101 and sm100 compatibility by @cyx-6 in #1631
- bugfix: collect all modules to aot by @yzh119 in #1622
- fix: pass workspace for trtllm-gen attention by @yyihuang in #1635
- feat: cutlass fp8 gemm bringup for SM120 & SM121 by @yongwww in #1610
- test: pytest.mark.xfail on deepgemm by @yongwww in #1636
- release: bump version v0.3.1 by @yongwww in #1637
Full Changelog: v0.3.0...v0.3.1