Speculators v0.5.0 Release Notes

This Speculators v0.5.0 release adds support for the DFlash algorithm, online training, and unifies all data generation — both online and offline — under vLLM's hidden states extraction system. Documentation has been expanded with end-to-end tutorials for all supported training workflows.

Key new features include:

DFlash algorithm training support
Full online training support
Both online and offline training now use vLLM's native hidden states extraction system
New tutorials for model serving, E2E online & offline Eagle 3 training, and E2E online DFlash training

DFlash Training Support ✨

Speculators now supports training DFlash speculative decoding draft models. Unlike Eagle 3, which generates draft tokens autoregressively across multiple forward passes, DFlash uses a block diffusion approach to generate an entire block of draft tokens in a single forward pass. This parallel drafting reduces inter-token latency compared to Eagle 3.

Training support includes a new DFlash model definition, config, and associated training examples. The trainer has been updated to accept DFlash-specific arguments, and attention utilities are now shared across Eagle 3 and DFlash.

With this, a Gemma 4 DFlash speculator was released, showing the following per-position acceptance rates:

Dataset	Position 0	Position 1	Position 2	Position 3	Position 4	Position 5	Position 6	Position 7	Acceptance Length
HumanEval	85.8%	72.1%	60.3%	50.4%	41.8%	34.3%	26.9%	19.6%	4.91
Math Reasoning	88.7%	76.1%	64.8%	54.9%	45.5%	36.5%	28.8%	21.5%	5.17

Gemma 4 DFlash achieves better inter-token latency than both Eagle 3 and a standalone FP8 quantized verifier. Combining DFlash with an FP8 quantized verifier yields even greater gains, as shown below:

Hidden States Extraction System Integration

As of v0.5.0, both online and offline training in Speculators use vLLM's native hidden states extraction system. This is a significant unification: previously, offline data generation used a separate Speculators-managed system, and online training was not supported at all.

vLLM's hidden states extraction system, introduced in vLLM v0.18.0, provides a native way to extract intermediate model representations during inference. It routes hidden states through vLLM's existing speculative decoding pathway via a dummy draft model, storing them in a dedicated KV cache and exporting them via a custom KV Connector API. This design reuses vLLM's existing infrastructure — including tensor parallelism, prefix caching, and paged memory management — with minimal overhead on standard inference.

Offline training has been migrated to this system. It is more performant, better integrated with vLLM, and eliminates the risk of divergence between training and serving behavior that existed with the previous Speculators-managed data generation system.

Online training is now supported for the first time in this release. Speculators can train directly on live hidden states generated on-the-fly — eliminating the need to pre-cache training data entirely. This also enables hybrid training approaches combining online and offline data.

The online training workflow:

Response regeneration — regenerate target model responses
prepare_data.py — tokenize and format data
launch_vllm.py — launch the vLLM server
train.py — extract hidden states and train

Documentation Updates

The Speculators documentation has been refreshed to reflect the updated hidden states system and show up-to-date usage across all training flows.

New tutorials have been added covering the main usage workflows:

Model serving
E2E online & offline training with Eagle3 models
E2E online training with DFlash models

Updated Examples

Training examples have been added for Eagle3 and DFlash:

dflash_qwen3_8b_sharegpt_online_5k.sh — Online DFlash training with Qwen3-8B on ShareGPT
eagle3_llama3_8b_ultrachat_offline_5k.sh — Offline Eagle3 training with Llama3-8B on UltraChat
eagle3_qwen3_8b_sharegpt_online_5k.sh — Online Eagle3 training with Qwen3-8B on ShareGPT

Other Updates

Updated to support transformers v5.6
Updated to support torch 2.11

Deprecations

The data generation system previously supported through Speculators v0.3.0 has been deprecated and removed as of v0.5.0. The old system required a vLLM dependency which has also been removed. All training flows are now supported through vLLM's hidden states extraction system.

New Contributors

@shubhra made their first contribution in #337
@benchislett made their first contribution in #346
@surojitiitg made their first contribution in #334

Full Changelog: v0.4.0.1...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculators v0.5.0

Choose a tag to compare

Sorry, something went wrong.