v0.3.0
What's Changed
- Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
- feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
- bump cutlass submodule to v4.2 by @ttyio in #1572
- typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
- benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
- bugfix: fix cuda version guard macros by @nvjullin in #1571
- misc: remove some unused files by @yzh119 in #1574
- bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
- feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
- fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
- bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
- refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
- fix unignorable narrowing conversion issue by @luccafong in #1586
- bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
- update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
- fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
- fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
- refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
- misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
- Mnnvl memory with custom communicator by @wenscarl in #1245
- Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
- bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
- Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
- bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
- bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
- bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
- bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
- bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
- feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
- ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
- ci: Fix unittests of logits processor by @yzh119 in #1602
- feat: integrate xqa attention backend by @qsang-nv in #1503
- [cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
- bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
- feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
- ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
- releas: bump version v0.3.0 by @yzh119 in #1617
New Contributors
- @amitz-nv made their first contribution in #1582
- @luccafong made their first contribution in #1586
- @trevor-m made their first contribution in #1550
- @VALLIS-NERIA made their first contribution in #1534
- @hijkzzz made their first contribution in #1598
- @qsang-nv made their first contribution in #1503
- @limin2021 made their first contribution in #1607
Full Changelog: v0.2.14.post1...v0.3.0