Skip to content

[Tracking issue]: TurboQuant/HIGGS Attention follow-ups #40069

@mgoin

Description

@mgoin

Tracking follow-up work on the TurboQuant/HIGGS KV cache attention backend initially landed in #38479.

Backend coverage

  • Expand flash_attn_varlen_func to FA3/4, not just FA2
  • Hybrid attention models (e.g. Qwen3.5, mamba+attention, interleaved SWA?)
  • MLA support (through a new attention backend?)

Accuracy

  • Long-context evals across presets (k8v4, t4nc, k3v4nc, t3nc): RULER, NIAH at 32K–1M, LongBench
  • Per-layer sensitivity sweep to inform --kv-cache-dtype-skip-layers defaults
  • Publish recommended config table (quality vs. compression vs. throughput) based on eval results
  • Add new presets as the sweeps suggest (e.g. mixed-bit, per-layer schedules)

Feature compatibility

Things currently disabled or unverified with the TurboQuant backend; enable and test:

  • Speculative decoding / Eagle
  • KV connector / disaggregated serving (NIXL, LMCache, Mooncake)

Performance

  • CUDA/cutedsl kernels to replace the triton kernels
  • Validate AMD performance
  • Revisit stream-overlap gating under CUDAGraph
  • FP8 decode path parity on Hopper

cc @vibhavagarwal5

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions