[Speculative Decoding] Add DFlash speculators config parsing#38300
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements DFlash speculative decoding for Qwen3 models in the vLLM V1 engine. Key additions include the DFlashProposer, a specialized Qwen3 DFlash model with fused KV precomputation, and a Triton kernel for efficient input preparation. The changes also extend the attention selector to support non-causal attention required by DFlash. Feedback was provided regarding the _raise_if_multimodal override in the proposer, which currently enables an untested code path; it is recommended to remove this override to maintain stability for multimodal inputs.
|
This pull request has merge conflicts that must be resolved before it can be |
f229ecb to
fa56363
Compare
fa56363 to
41bd6e7
Compare
| from vllm.config import SpeculativeConfig | ||
| from vllm.distributed import cleanup_dist_env_and_memory | ||
|
|
||
| MODEL_PATH = "shanjiaz/speculators-dflash-format" |
There was a problem hiding this comment.
Could you use this model instead? https://huggingface.co/nm-testing/dflash-qwen3-8b-speculators Thanks!
|
Thank you! @rahul-tuli please help review |
41bd6e7 to
3c202ed
Compare
|
Hi @ZhanqiuHu, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
rahul-tuli
left a comment
There was a problem hiding this comment.
LGTM! pending two nits
024300c to
410498e
Compare
|
|
||
| EXPECTED_GSM8K_ACCURACY = 0.885 | ||
| ACCURACY_RTOL = 0.03 | ||
| EXPECTED_ACCEPTANCE_LEN = 1.84 |
There was a problem hiding this comment.
Is this a dummy checkpoint? That score seems really low for a DFlash speculator
There was a problem hiding this comment.
This checkpoint is trained with speculators using different setup than the original dflash qwen3-8b config. For example, it only uses 3 layers instead of 5. I confirmed with @shanjiaz on the expected acc len.
There was a problem hiding this comment.
@shanjiaz is there a reason the accuracy is so low? 50% on the first token indicates either a very short/poor training run, or a problem in the code. What's going on here?
There was a problem hiding this comment.
@ZhanqiuHu could you run the DFlash reference checkpoint on the same test to get a baseline to compare against? I'm shocked it would be so low, even with fewer params
There was a problem hiding this comment.
@benchislett This is a relatively short run with limited data only used for testing. We'll replace this with a better checkpoint when we have the time/resources to produce a better one.
There was a problem hiding this comment.
@benchislett z-lab/Qwen3-8B-DFlash-b16, 5 layers, GSM8K-5shot:
Accuracy: 0.886
AL: 3.70
Per-position: 0.756, 0.584, 0.440, 0.320, 0.227, 0.153, 0.102, 0.066
There was a problem hiding this comment.
OK, acceptable for now but please create a github issue to track and update it when you have a better checkpoint.
There was a problem hiding this comment.
There are lots of bugs in specdec that tend to only manifest on later predicted tokens, especially with parallel drafting, and the coverage of those issues is not as good if the test model has very low AR since it falls off fast anyways
There was a problem hiding this comment.
Hey @benchislett, we have trained a new dflash model that gets better acceptance rate:
SpecDecoding metrics: Mean acceptance length: 3.42, Accepted throughput: 4858.14 tokens/s, Drafted throughput: 16030.11 tokens/s, Accepted: 59735 tokens, Drafted: 197104 tokens, Per-position acceptance rate: 0.794, 0.607, 0.424, 0.277, 0.166, 0.090, 0.046, 0.021, Avg Draft acceptance rate: 30.3%
Let us know : )
Head branch was pushed to by a user without write access
4b5cc91 to
c4ceb01
Compare
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…diff Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
48677ec to
c036e37
Compare
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…oject#38300) Signed-off-by: Zhanqiu Hu <zhu@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Summary
--speculative-configto override auto-detected values (config.py)qwen3_dflash.pyweight loading: d2t/t2d/verifier handling (similar to Eagle3 patterns)Test Results (
shanjiaz/speculators-dflash-format, Qwen3-8B target)GSM8K Correctness (1319 questions, 5-shot, batched)
Magpie Acceptance Rates (200 prompts, batch-size-1)
@shanjiaz @fynnsu Ready for review. Needs confirmation on
qwen3_dflash.pychanges.Magpie validation script