Turboquant native fp8 v4 store by sladyn98 · Pull Request #45748 · vllm-project/vllm

sladyn98 · 2026-06-16T00:01:43Z

Related to #40069.

Add a native CUDA store path for TurboQuant turboquant_k8v4 FP8-key / 4-bit-value
KV cache storage on Hopper+ GPUs.

The existing Triton store path remains the fallback. The native path is only
selected when the request matches the supported contract:

FP8 key path
4-bit value quantization
CUDA device using Hopper+ E4M3 FP8
contiguous int32 slot_mapping
native op is built and available

This also adds a kill-switch through VLLM_DISABLE_TURBOQUANT_NATIVE_STORE and
validation guards in the C++ wrapper.

Test Plan

Add byte-parity coverage comparing the native CUDA store against the existing
Triton store.
Cover fp16 and bf16 inputs.
Cover D=127, D=128, and D=256.
Cover FP8 saturation with large key values.
Cover negative slot mappings.
Add an H100 Buildkite lane for the native store parity test so it does not
silently skip on non-Hopper CI workers.

Test Result

Local static checks passed:

git diff --check main...HEAD
python3 -m py_compile \
  vllm/v1/attention/ops/triton_turboquant_store.py \
  vllm/_custom_ops.py \
  vllm/envs.py \
  tests/quantization/test_turboquant.py
uvx ruff check \
  vllm/v1/attention/ops/triton_turboquant_store.py \
  vllm/_custom_ops.py \
  vllm/envs.py \
  tests/quantization/test_turboquant.py
uvx ruff format --check \
  vllm/v1/attention/ops/triton_turboquant_store.py \
  vllm/_custom_ops.py \
  vllm/envs.py \
  tests/quantization/test_turboquant.py

I could not run the CUDA parity test locally because this machine does not have a
CUDA/Hopper environment. The PR adds a dedicated H100 CI lane to run:

pytest -v -s quantization/
test_turboquant.py::TestStoreDecodeRoundTrip::test_native_fp8_v4_store_matches_trit
on

———
<details>
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR
  will resolve)".
- [x] The test plan, such as providing test command.
- [x] The test results, such as pasting the results comparison before and after, or
  e2e results.
- [x] Documentation update considered. Not needed because this is an internal
  kernel dispatch optimization with Triton fallback and no user-facing API change.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing
(https://docs.vllm.ai/en/latest/contributing)

Signed-off-by: Sladyn Nunes <sladyn@meta.com>

sladyn98 added 2 commits June 12, 2026 18:21

Add native CUDA TurboQuant FP8 V4 store

bb13cb3

Signed-off-by: Sladyn Nunes <sladyn@meta.com>

Harden native TurboQuant store validation

7fa104a

Signed-off-by: Sladyn Nunes <sladyn@meta.com>

mergify Bot added ci/build v1 labels Jun 16, 2026

Merge branch 'main' into turboquant-native-fp8-v4-store

4963eee

sladyn98 marked this pull request as ready for review June 16, 2026 00:56

sladyn98 requested review from AndreasKaratzas, Harry-Chen, LucasWilkinson, MatthewBonanni, khluu, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and zyongye as code owners June 16, 2026 00:56

Harden TurboQuant native store test coverage

61279d5

Signed-off-by: Sladyn Nunes <sladyn@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Turboquant native fp8 v4 store#45748

Turboquant native fp8 v4 store#45748
sladyn98 wants to merge 4 commits into
vllm-project:mainfrom
sladyn98:turboquant-native-fp8-v4-store

sladyn98 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sladyn98 commented Jun 16, 2026

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant