PARD is a high-performance speculative decoding method that also enables low-cost adaptation of autoregressive draft models into parallel draft models. It offers the following advantages:
-
Low-Cost Training: PARD adapts AR (autoregressive) draft models into parallel draft models with minimal overhead. Compared to pure AR draft models, PARD achieves an average inference speedup of 1.78×. By introducing a conditional drop-token strategy, PARD improves training efficiency by up to 3× while maintaining the same level of accuracy.
-
Generalizability: Thanks to its target-independent design, a single PARD draft model can accelerate an entire family of target models. This contrasts with target-dependent approaches such as Medusa and EAGLE, which require retraining or tuning for each new target. As a result, PARD significantly reduces both deployment complexity and adaptation cost.
-
High Performance: When integrated into an optimized inference framework called Transformers+ PARD delivers up to a 4.08× speedup, with LLaMA3.1 8B reaches a state-of-the-art 311.5 tokens per second. When integrated into vLLM, PARD delivers up to 3.06× speedup, outperforming other speculative decoding methods in vLLM by 1.51×.

- 2026.02.06: PARD is now officially supported in vLLM!
- 2026.01.26: PARD is accepted to ICLR'26.
- 2025.10.20: Support Llama4
- 2025.07.16: Support Qwen3
- 2025.06.30: Support vLLM.
- The PARD results reported in the paper were obtained with vLLM v0. The table below presents the results on vLLM v1, which achieve higher speedups.
- The vLLM version used is v0.16.0 (V1 engine). For EAGLE3, Llama 3.1 8B and Llama 3.3 70B use the official EAGLE3 model weights, while Qwen3 8B uses the AngelSlim/Qwen3-8B_eagle3 model weights. Qwen3 was evaluated in no-thinking mode, and the optimal draft_k was selected for each model and task.
| method | target model | framework | device | humaneval tps | humaneval speedup | gsm8k tps | gsm8k speedup | mt_bench tps | mt_bench speedup | average tps | average speedup |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | L3.1 8B | vllm-v0.16.0 | A100 | 78.43 | 1.00 | 78.43 | 1.00 | 78.37 | 1.00 | 78.41 | 1.00 |
| EAGLE3 | L3.1 8B | vllm-v0.16.0 | A100 | 245.10 | 3.13 | 204.08 | 2.60 | 189.39 | 2.42 | 212.86 | 2.71 |
| PARD | L3.1 8B | vllm-v0.16.0 | A100 | 373.13 | 4.76 | 313.48 | 4.00 | 213.22 | 2.72 | 299.94 | 3.83 |
| Baseline | Q3 8B | vllm-v0.16.0 | A100 | 76.51 | 1.00 | 76.57 | 1.00 | 76.39 | 1.00 | 76.49 | 1.00 |
| EAGLE3 | Q3 8B | vllm-v0.16.0 | A100 | 160.51 | 2.10 | 146.63 | 1.91 | 127.06 | 1.66 | 144.74 | 1.89 |
| PARD | Q3 8B | vllm-v0.16.0 | A100 | 386.10 | 5.05 | 336.70 | 4.40 | 192.31 | 2.52 | 305.04 | 3.99 |
| Baseline | l3.3 70B | vllm-v0.16.0 | H20 | 70.08 | 1.00 | 70.92 | 1.00 | 70.97 | 1.00 | 70.66 | 1.00 |
| EAGLE3 | l3.3 70B | vllm-v0.16.0 | H20 | 251.89 | 3.59 | 208.33 | 2.94 | 187.27 | 2.64 | 215.83 | 3.06 |
| PARD | l3.3 70B | vllm-v0.16.0 | H20 | 377.36 | 5.38 | 320.51 | 4.52 | 191.57 | 2.70 | 296.48 | 4.20 |
# rocm
rocm/pytorch:rocm6.3.2_ubuntu22.04_py3.10_pytorch_release_2.5.1_preview
# cuda
nvcr.io/nvidia/pytorch:25.02-py3
git clone https://github.com/AMD-AGI/PARD
cd PARD
pip3 install -r requirement.txt --no-build-isolation
| Model Series | Model Name | Download |
|---|---|---|
| llama3 | PARD-Llama-3.2-1B | 🤗 HuggingFace |
| llama4 | PARD-Llama-4-1B | 🤗 HuggingFace |
| DSR Qwen | PARD-DeepSeek-R1-Distill-Qwen-1.5B | 🤗 HuggingFace |
| Qwen | PARD-Qwen2.5-0.5B | 🤗 HuggingFace |
| Qwen3 | PARD-Qwen3-0.6B | 🤗 HuggingFace |
python3 -m pard.infer -c config/eval/llama3_eval.yaml
python3 -m pard.infer -c config/eval/dsrq_eval.yaml
python3 -m pard.infer -c config/eval/qwen_eval.yaml
-
-k,--draft_k(default: 12) Specifies the number of draft tokens to be generated in each speculative decoding iteration. Setting this to 0 disables speculative decoding and runs the baseline method instead. -
--tokens(default: 512) Sets the max number of tokens to during the inference. -
-d,--draft(default:'qwen_0.5b_pard') The name or path of the draft model. -
-t,--target(default:'qwen_2.5_7b') The name or path of the target model. -
-b,--benchmark(default:'humaneval') Specifies the benchmark dataset to use for evaluation. Choices includehumaneval,gsm8kandmath500. -
-ms,--model_serie(default: None) Model series of target model. Choices includellama3,qwen,r1andNone. When set to None, the series will be automatically inferred from the target model's name. -
--para(flag; default: False) Enables the Parallel Draft model mode. When set to False, an autoregressive (AR) Draft model is used instead. -
--nc(flag; default: False) Disables torch compile. -
--maxtune(flag; default: False) Enables maxtune for Target model -
--max_cache_len(default: None) Sets the maximum cache length for the model. If not provided, it defaults to the value of tokens.
PARD has already been integrated into vLLM. Official example: Document
python3 -m pard.train -c config/train/example_qwen.yaml
@article{an2025pard,
title={PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation},
author={An, Zihao and Bai, Huajun and Liu, Ziqiong and Li, Dong and Barsoum, Emad},
journal={arXiv preprint arXiv:2504.18583},
year={2025}
}
