Skip to content

Speculators v0.5.0

Latest

Choose a tag to compare

@dsikka dsikka released this 24 Apr 16:16
· 89 commits to main since this release
f26d8a7
speculatorv0 5 0

Speculators v0.5.0 Release Notes

This Speculators v0.5.0 release adds support for the DFlash algorithm, online training, and unifies all data generation — both online and offline — under vLLM's hidden states extraction system. Documentation has been expanded with end-to-end tutorials for all supported training workflows.

Key new features include:

  • DFlash algorithm training support
  • Full online training support
  • Both online and offline training now use vLLM's native hidden states extraction system
  • New tutorials for model serving, E2E online & offline Eagle 3 training, and E2E online DFlash training

DFlash Training Support ✨

Speculators now supports training DFlash speculative decoding draft models. Unlike Eagle 3, which generates draft tokens autoregressively across multiple forward passes, DFlash uses a block diffusion approach to generate an entire block of draft tokens in a single forward pass. This parallel drafting reduces inter-token latency compared to Eagle 3.

Training support includes a new DFlash model definition, config, and associated training examples. The trainer has been updated to accept DFlash-specific arguments, and attention utilities are now shared across Eagle 3 and DFlash.

With this, a Gemma 4 DFlash speculator was released, showing the following per-position acceptance rates:

Dataset Position 0 Position 1 Position 2 Position 3 Position 4 Position 5 Position 6 Position 7 Acceptance Length
HumanEval 85.8% 72.1% 60.3% 50.4% 41.8% 34.3% 26.9% 19.6% 4.91
Math Reasoning 88.7% 76.1% 64.8% 54.9% 45.5% 36.5% 28.8% 21.5% 5.17

Gemma 4 DFlash achieves better inter-token latency than both Eagle 3 and a standalone FP8 quantized verifier. Combining DFlash with an FP8 quantized verifier yields even greater gains, as shown below:

Screenshot 2026-04-19 at 4 31 20 PM (1)

Hidden States Extraction System Integration

As of v0.5.0, both online and offline training in Speculators use vLLM's native hidden states extraction system. This is a significant unification: previously, offline data generation used a separate Speculators-managed system, and online training was not supported at all.

vLLM's hidden states extraction system, introduced in vLLM v0.18.0, provides a native way to extract intermediate model representations during inference. It routes hidden states through vLLM's existing speculative decoding pathway via a dummy draft model, storing them in a dedicated KV cache and exporting them via a custom KV Connector API. This design reuses vLLM's existing infrastructure — including tensor parallelism, prefix caching, and paged memory management — with minimal overhead on standard inference.

Offline training has been migrated to this system. It is more performant, better integrated with vLLM, and eliminates the risk of divergence between training and serving behavior that existed with the previous Speculators-managed data generation system.

Online training is now supported for the first time in this release. Speculators can train directly on live hidden states generated on-the-fly — eliminating the need to pre-cache training data entirely. This also enables hybrid training approaches combining online and offline data.

The online training workflow:

  1. Response regeneration — regenerate target model responses
  2. prepare_data.py — tokenize and format data
  3. launch_vllm.py — launch the vLLM server
  4. train.py — extract hidden states and train

Documentation Updates

The Speculators documentation has been refreshed to reflect the updated hidden states system and show up-to-date usage across all training flows.

New tutorials have been added covering the main usage workflows:

Updated Examples

Training examples have been added for Eagle3 and DFlash:

Other Updates

  • Updated to support transformers v5.6
  • Updated to support torch 2.11

Deprecations

The data generation system previously supported through Speculators v0.3.0 has been deprecated and removed as of v0.5.0. The old system required a vLLM dependency which has also been removed. All training flows are now supported through vLLM's hidden states extraction system.

New Contributors

Full Changelog: v0.4.0.1...v0.5.0