Release v0.3.0 · flashinfer-ai/flashinfer

What's Changed

Backend: downgrade trtllm-gen kernel to cuda-12 by @cyx-6 in #1567
feat: Add fp8-qkv, fp16/bf16 output MHA by @weireweire in #1540
bump cutlass submodule to v4.2 by @ttyio in #1572
typo: fix typo in variable names of fp4 masked gemm by @fzyzcjy in #1570
benchmark: Add autotunner to moe benchmark by @nv-yunzheq in #1536
bugfix: fix cuda version guard macros by @nvjullin in #1571
misc: remove some unused files by @yzh119 in #1574
bugfix: update trtllm-gen gemm kernel names by @cyx-6 in #1577
feat: Support for inferring out_dtype from out.dtype for TRTLLM attention kernel by @elvischenv in #1578
fix: semaphoress must be at the fixed range in workspace buffer on trtllm_gen attention by @yyihuang in #1584
bugfix: Fix arg passing to TORCH_CHECK and TORCH_WARN macros by @amitz-nv in #1582
refactor: Expose calculate_tile_tokens_dim function by @amitz-nv in #1581
fix unignorable narrowing conversion issue by @luccafong in #1586
bugfix: Fix test_fp4_quantize test bug by @sricketts in #1585
update trtllm-gen fp4 autotuner and routing by @IwakuraRein in #1573
fix: limit the number of nvcc threads for each kernel by @yzh119 in #1589
fix: Improve TRTLLM attention kernel out_dtype unit test by @elvischenv in #1590
refactor: use allocator class for workspace buffer allocation by @yyihuang in #1588
misc: Fix footnote and typo in CONTRIBUTING.md by @sricketts in #1583
Mnnvl memory with custom communicator by @wenscarl in #1245
Add mnnvl_moe_alltoallv_prepare_without_allgather by @trevor-m in #1550
bugfix: Adding version checks to tests/test_hopper*.py files by @bkryu in #1594
Remove cuda-python from dependency and check at runtime by @VALLIS-NERIA in #1534
bugfix: fix fused-temperature softmax IMA issue by @yzh119 in #1596
bugfix: Fix RuntimeError("FlashInfer requires sm75+") by @hijkzzz in #1598
bugfix: fix the register overflow issue for topk renorm kernels on blackwell by @yzh119 in #1597
bugfix: fix unittest test_fp8_quantize by @yzh119 in #1599
bugfix: fix multi-gpu/node unit-test: skip when there aren't enough GPUs instead of failing by @bkryu in #1600
feat: Enable MnnvlMemory (for alltoallv) on B200 by @trevor-m in #1601
ci: add ci container of cuda 13 and add cute-dsl as dependency. by @yzh119 in #1595
ci: Fix unittests of logits processor by @yzh119 in #1602
feat: integrate xqa attention backend by @qsang-nv in #1503
[cute dsl] optimize cute dsl make_ptr perf by @limin2021 in #1607
bugfix: fix fp4 quantization with 8x4 scale factor layout by @cyx-6 in #1611
feat: enable trtllm-gen attn speculative decoding verify by decode by @yyihuang in #1453
ci: limit aot parallel build jobs based on available memory by @yongwww in #1612
releas: bump version v0.3.0 by @yzh119 in #1617

New Contributors

@amitz-nv made their first contribution in #1582
@luccafong made their first contribution in #1586
@trevor-m made their first contribution in #1550
@VALLIS-NERIA made their first contribution in #1534
@hijkzzz made their first contribution in #1598
@qsang-nv made their first contribution in #1503
@limin2021 made their first contribution in #1607

Full Changelog: v0.2.14.post1...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!