Speculators v0.5.0 Release Notes
This Speculators v0.5.0 release adds support for the DFlash algorithm, online training, and unifies all data generation — both online and offline — under vLLM's hidden states extraction system. Documentation has been expanded with end-to-end tutorials for all supported training workflows.
Key new features include:
- DFlash algorithm training support
- Full online training support
- Both online and offline training now use vLLM's native hidden states extraction system
- New tutorials for model serving, E2E online & offline Eagle 3 training, and E2E online DFlash training
DFlash Training Support ✨
Speculators now supports training DFlash speculative decoding draft models. Unlike Eagle 3, which generates draft tokens autoregressively across multiple forward passes, DFlash uses a block diffusion approach to generate an entire block of draft tokens in a single forward pass. This parallel drafting reduces inter-token latency compared to Eagle 3.
Training support includes a new DFlash model definition, config, and associated training examples. The trainer has been updated to accept DFlash-specific arguments, and attention utilities are now shared across Eagle 3 and DFlash.
With this, a Gemma 4 DFlash speculator was released, showing the following per-position acceptance rates:
| Dataset | Position 0 | Position 1 | Position 2 | Position 3 | Position 4 | Position 5 | Position 6 | Position 7 | Acceptance Length |
|---|---|---|---|---|---|---|---|---|---|
| HumanEval | 85.8% | 72.1% | 60.3% | 50.4% | 41.8% | 34.3% | 26.9% | 19.6% | 4.91 |
| Math Reasoning | 88.7% | 76.1% | 64.8% | 54.9% | 45.5% | 36.5% | 28.8% | 21.5% | 5.17 |
Gemma 4 DFlash achieves better inter-token latency than both Eagle 3 and a standalone FP8 quantized verifier. Combining DFlash with an FP8 quantized verifier yields even greater gains, as shown below:
Hidden States Extraction System Integration
As of v0.5.0, both online and offline training in Speculators use vLLM's native hidden states extraction system. This is a significant unification: previously, offline data generation used a separate Speculators-managed system, and online training was not supported at all.
vLLM's hidden states extraction system, introduced in vLLM v0.18.0, provides a native way to extract intermediate model representations during inference. It routes hidden states through vLLM's existing speculative decoding pathway via a dummy draft model, storing them in a dedicated KV cache and exporting them via a custom KV Connector API. This design reuses vLLM's existing infrastructure — including tensor parallelism, prefix caching, and paged memory management — with minimal overhead on standard inference.
Offline training has been migrated to this system. It is more performant, better integrated with vLLM, and eliminates the risk of divergence between training and serving behavior that existed with the previous Speculators-managed data generation system.
Online training is now supported for the first time in this release. Speculators can train directly on live hidden states generated on-the-fly — eliminating the need to pre-cache training data entirely. This also enables hybrid training approaches combining online and offline data.
The online training workflow:
- Response regeneration — regenerate target model responses
- prepare_data.py — tokenize and format data
- launch_vllm.py — launch the vLLM server
- train.py — extract hidden states and train
Documentation Updates
The Speculators documentation has been refreshed to reflect the updated hidden states system and show up-to-date usage across all training flows.
New tutorials have been added covering the main usage workflows:
- Model serving
- E2E online & offline training with Eagle3 models
- E2E online training with DFlash models
Updated Examples
Training examples have been added for Eagle3 and DFlash:
- dflash_qwen3_8b_sharegpt_online_5k.sh — Online DFlash training with Qwen3-8B on ShareGPT
- eagle3_llama3_8b_ultrachat_offline_5k.sh — Offline Eagle3 training with Llama3-8B on UltraChat
- eagle3_qwen3_8b_sharegpt_online_5k.sh — Online Eagle3 training with Qwen3-8B on ShareGPT
Other Updates
- Updated to support transformers v5.6
- Updated to support torch 2.11
Deprecations
The data generation system previously supported through Speculators v0.3.0 has been deprecated and removed as of v0.5.0. The old system required a vLLM dependency which has also been removed. All training flows are now supported through vLLM's hidden states extraction system.
New Contributors
- @shubhra made their first contribution in #337
- @benchislett made their first contribution in #346
- @surojitiitg made their first contribution in #334
Full Changelog: v0.4.0.1...v0.5.0