PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

Introduction

PARD is a high-performance speculative decoding method that also enables low-cost adaptation of autoregressive draft models into parallel draft models. It offers the following advantages:

Low-Cost Training: PARD adapts AR (autoregressive) draft models into parallel draft models with minimal overhead. Compared to pure AR draft models, PARD achieves an average inference speedup of 1.78×. By introducing a conditional drop-token strategy, PARD improves training efficiency by up to 3× while maintaining the same level of accuracy.
Generalizability: Thanks to its target-independent design, a single PARD draft model can accelerate an entire family of target models. This contrasts with target-dependent approaches such as Medusa and EAGLE, which require retraining or tuning for each new target. As a result, PARD significantly reduces both deployment complexity and adaptation cost.
High Performance: When integrated into an optimized inference framework called Transformers+ PARD delivers up to a 4.08× speedup, with LLaMA3.1 8B reaches a state-of-the-art 311.5 tokens per second. When integrated into vLLM, PARD delivers up to 3.06× speedup, outperforming other speculative decoding methods in vLLM by 1.51×.

AR and AR+ represent baseline auto-regressive generation using Transformers and Transformers+, respectively. VSD denotes vanilla speculative decoding. PARD refers to the proposed method in this work.

Update

2026.02.06: PARD is now officially supported in vLLM!
2026.01.26: PARD is accepted to ICLR'26.
2025.10.20: Support Llama4
2025.07.16: Support Qwen3
2025.06.30: Support vLLM.

PARD results on vLLM v1

The PARD results reported in the paper were obtained with vLLM v0. The table below presents the results on vLLM v1, which achieve higher speedups.
The vLLM version used is v0.16.0 (V1 engine). For EAGLE3, Llama 3.1 8B and Llama 3.3 70B use the official EAGLE3 model weights, while Qwen3 8B uses the AngelSlim/Qwen3-8B_eagle3 model weights. Qwen3 was evaluated in no-thinking mode, and the optimal draft_k was selected for each model and task.

method	target model	framework	device	humaneval tps	humaneval speedup	gsm8k tps	gsm8k speedup	mt_bench tps	mt_bench speedup	average tps	average speedup
Baseline	L3.1 8B	vllm-v0.16.0	A100	78.43	1.00	78.43	1.00	78.37	1.00	78.41	1.00
EAGLE3	L3.1 8B	vllm-v0.16.0	A100	245.10	3.13	204.08	2.60	189.39	2.42	212.86	2.71
PARD	L3.1 8B	vllm-v0.16.0	A100	373.13	4.76	313.48	4.00	213.22	2.72	299.94	3.83
Baseline	Q3 8B	vllm-v0.16.0	A100	76.51	1.00	76.57	1.00	76.39	1.00	76.49	1.00
EAGLE3	Q3 8B	vllm-v0.16.0	A100	160.51	2.10	146.63	1.91	127.06	1.66	144.74	1.89
PARD	Q3 8B	vllm-v0.16.0	A100	386.10	5.05	336.70	4.40	192.31	2.52	305.04	3.99
Baseline	l3.3 70B	vllm-v0.16.0	H20	70.08	1.00	70.92	1.00	70.97	1.00	70.66	1.00
EAGLE3	l3.3 70B	vllm-v0.16.0	H20	251.89	3.59	208.33	2.94	187.27	2.64	215.83	3.06
PARD	l3.3 70B	vllm-v0.16.0	H20	377.36	5.38	320.51	4.52	191.57	2.70	296.48	4.20

Installation

Base Docker

# rocm
rocm/pytorch:rocm6.3.2_ubuntu22.04_py3.10_pytorch_release_2.5.1_preview

# cuda
nvcr.io/nvidia/pytorch:25.02-py3

Requirements

git clone https://github.com/AMD-AGI/PARD
cd PARD
pip3 install -r requirement.txt --no-build-isolation

Model Weights

Model Series	Model Name	Download
llama3	PARD-Llama-3.2-1B	🤗 HuggingFace
llama4	PARD-Llama-4-1B	🤗 HuggingFace
DSR Qwen	PARD-DeepSeek-R1-Distill-Qwen-1.5B	🤗 HuggingFace
Qwen	PARD-Qwen2.5-0.5B	🤗 HuggingFace
Qwen3	PARD-Qwen3-0.6B	🤗 HuggingFace

Eval With Transformers+

Llama3 Series

python3 -m pard.infer -c config/eval/llama3_eval.yaml

DeepSeek-R1-Distill-Qwen Series

python3 -m pard.infer -c config/eval/dsrq_eval.yaml

Qwen Series

python3 -m pard.infer -c config/eval/qwen_eval.yaml

Arguments Description

-k, --draft_k (default: 12) Specifies the number of draft tokens to be generated in each speculative decoding iteration. Setting this to 0 disables speculative decoding and runs the baseline method instead.
--tokens (default: 512) Sets the max number of tokens to during the inference.
-d, --draft (default: 'qwen_0.5b_pard') The name or path of the draft model.
-t, --target (default: 'qwen_2.5_7b') The name or path of the target model.
-b, --benchmark (default: 'humaneval') Specifies the benchmark dataset to use for evaluation. Choices include humaneval, gsm8k and math500.
-ms, --model_serie (default: None) Model series of target model. Choices include llama3, qwen, r1 and None. When set to None, the series will be automatically inferred from the target model's name.
--para (flag; default: False) Enables the Parallel Draft model mode. When set to False, an autoregressive (AR) Draft model is used instead.
--nc (flag; default: False) Disables torch compile.
--maxtune (flag; default: False) Enables maxtune for Target model
--max_cache_len (default: None) Sets the maximum cache length for the model. If not provided, it defaults to the value of tokens.

Inference with vLLM

PARD has already been integrated into vLLM. Official example: Document

Training Example

python3 -m pard.train -c config/train/example_qwen.yaml

Citation

@article{an2025pard,
  title={PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation},
  author={An, Zihao and Bai, Huajun and Liu, Ziqiong and Li, Dong and Barsoum, Emad},
  journal={arXiv preprint arXiv:2504.18583},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
datas		datas
pard		pard
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

Introduction

Update

PARD results on vLLM v1

Installation

Base Docker

Requirements

Model Weights

Eval With Transformers+

Llama3 Series

DeepSeek-R1-Distill-Qwen Series

Qwen Series

Arguments Description

Inference with vLLM

Training Example

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

Introduction

Update

PARD results on vLLM v1

Installation

Base Docker

Requirements

Model Weights

Eval With Transformers+

Llama3 Series

DeepSeek-R1-Distill-Qwen Series

Qwen Series

Arguments Description

Inference with vLLM

Training Example

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages