[Tracking issue]: TurboQuant/HIGGS Attention follow-ups

Tracking follow-up work on the TurboQuant/HIGGS KV cache attention backend initially landed in #38479.

### Backend coverage
- [x] Expand `flash_attn_varlen_func` to FA3/4, not just FA2
- [ ] Hybrid attention models (e.g. Qwen3.5, mamba+attention, interleaved SWA?)
- [ ] MLA support (through a new attention backend?)

### Accuracy
- [x] Long-context evals across presets (k8v4, t4nc, k3v4nc, t3nc): RULER, NIAH at 32K–1M, LongBench
- [ ] Per-layer sensitivity sweep to inform `--kv-cache-dtype-skip-layers` defaults
- [ ] Publish recommended config table (quality vs. compression vs. throughput) based on eval results
- [ ] Add new presets as the sweeps suggest (e.g. mixed-bit, per-layer schedules)

### Feature compatibility
Things currently disabled or unverified with the TurboQuant backend; enable and test:
- [ ] Speculative decoding / Eagle
- [ ] KV connector / disaggregated serving (NIXL, LMCache, Mooncake)

### Performance
- [ ] CUDA/cutedsl kernels to replace the triton kernels
- [ ] Validate AMD performance
- [ ] Revisit stream-overlap gating under CUDAGraph
- [ ] FP8 decode path parity on Hopper

cc @vibhavagarwal5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking issue]: TurboQuant/HIGGS Attention follow-ups #40069

Backend coverage

Accuracy

Feature compatibility

Performance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Tracking issue]: TurboQuant/HIGGS Attention follow-ups #40069

Description

Backend coverage

Accuracy

Feature compatibility

Performance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions