Skip to content

v0.3.0

Compare
Choose a tag to compare
@yzh119 yzh119 released this 01 Sep 06:21
· 35 commits to main since this release
f131f3d

What's Changed

  • Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
  • feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
  • bump cutlass submodule to v4.2 by @ttyio in #1572
  • typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
  • benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
  • bugfix: fix cuda version guard macros by @nvjullin in #1571
  • misc: remove some unused files by @yzh119 in #1574
  • bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
  • feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
  • fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
  • bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
  • refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
  • fix unignorable narrowing conversion issue by @luccafong in #1586
  • bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
  • update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
  • fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
  • fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
  • refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
  • misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
  • Mnnvl memory with custom communicator by @wenscarl in #1245
  • Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
  • bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
  • Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
  • bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
  • bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
  • bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
  • bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
  • bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
  • feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
  • ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
  • ci: Fix unittests of logits processor by @yzh119 in #1602
  • feat: integrate xqa attention backend by @qsang-nv in #1503
  • [cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
  • bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
  • feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
  • ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
  • releas: bump version v0.3.0 by @yzh119 in #1617

New Contributors

Full Changelog: v0.2.14.post1...v0.3.0