Skip to content

Turboquant native fp8 v4 store#45748

Open
sladyn98 wants to merge 4 commits into
vllm-project:mainfrom
sladyn98:turboquant-native-fp8-v4-store
Open

Turboquant native fp8 v4 store#45748
sladyn98 wants to merge 4 commits into
vllm-project:mainfrom
sladyn98:turboquant-native-fp8-v4-store

Conversation

@sladyn98

Copy link
Copy Markdown
Contributor

Related to #40069.

Add a native CUDA store path for TurboQuant turboquant_k8v4 FP8-key / 4-bit-value
KV cache storage on Hopper+ GPUs.

The existing Triton store path remains the fallback. The native path is only
selected when the request matches the supported contract:

  • FP8 key path
  • 4-bit value quantization
  • CUDA device using Hopper+ E4M3 FP8
  • contiguous int32 slot_mapping
  • native op is built and available

This also adds a kill-switch through VLLM_DISABLE_TURBOQUANT_NATIVE_STORE and
validation guards in the C++ wrapper.

Test Plan

  • Add byte-parity coverage comparing the native CUDA store against the existing
    Triton store.
  • Cover fp16 and bf16 inputs.
  • Cover D=127, D=128, and D=256.
  • Cover FP8 saturation with large key values.
  • Cover negative slot mappings.
  • Add an H100 Buildkite lane for the native store parity test so it does not
    silently skip on non-Hopper CI workers.

Test Result

Local static checks passed:

git diff --check main...HEAD
python3 -m py_compile \
  vllm/v1/attention/ops/triton_turboquant_store.py \
  vllm/_custom_ops.py \
  vllm/envs.py \
  tests/quantization/test_turboquant.py
uvx ruff check \
  vllm/v1/attention/ops/triton_turboquant_store.py \
  vllm/_custom_ops.py \
  vllm/envs.py \
  tests/quantization/test_turboquant.py
uvx ruff format --check \
  vllm/v1/attention/ops/triton_turboquant_store.py \
  vllm/_custom_ops.py \
  vllm/envs.py \
  tests/quantization/test_turboquant.py

I could not run the CUDA parity test locally because this machine does not have a
CUDA/Hopper environment. The PR adds a dedicated H100 CI lane to run:

pytest -v -s quantization/
test_turboquant.py::TestStoreDecodeRoundTrip::test_native_fp8_v4_store_matches_trit
on

———
<details>
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR
  will resolve)".
- [x] The test plan, such as providing test command.
- [x] The test results, such as pasting the results comparison before and after, or
  e2e results.
- [x] Documentation update considered. Not needed because this is an internal
  kernel dispatch optimization with Triton fallback and no user-facing API change.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing
(https://docs.vllm.ai/en/latest/contributing)

sladyn98 added 2 commits June 12, 2026 18:21
Signed-off-by: Sladyn Nunes <sladyn@meta.com>
Signed-off-by: Sladyn Nunes <sladyn@meta.com>
Signed-off-by: Sladyn Nunes <sladyn@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant