-
Notifications
You must be signed in to change notification settings - Fork 319
Description
Issue encountered
There are currently no integrations tests for the vLLM backend which means the following often happens:
- Bumping
vllm
to a new version breaks thelighteval
API - Bumping
lighteval
introduces unintended regressions due to refactors or changes to the public API
In addition, many prompt functions are not unit tested, so the expected inputs / outputs are only verified when users report issues in the downstream evaluation metrics.
Solution/Feature
All of the above points can be addressed by having end-to-end regression tests that run on multiple GPUs to test:
- Regressions on DP
- Regressions on TP
- Regressions on benchmark datasets, metrics, etc
For post-training benchmarks, I suggest we use the Qwen/Qwen3-0.6B
model as the reference since it is fast to run and supports hybrid reasoning via /think
and /no_think
system prompts.
I've created a table here for the most critical evals: https://docs.google.com/spreadsheets/d/1He7zlSBNv2EiQ9hwpMhGNFzFG1A8E7XvPG3d5yvcBNA/edit?usp=sharing
I'll let @loubnabnl @hynky1999 @guipenedo add the corresponding core evals for pre-training.
The basic idea is to run these core evals on a daily basis (with both DP and TP) so that regressions can be caught before end users are affected.
Also note, that to catch potential issues with future vllm
, transformers
releases etc we should add a separate set of integration tests against the main
branch of those projects. Here is how we do it in TRL: https://github.com/huggingface/trl/blob/main/.github/workflows/tests_latest.yml
Possible alternatives
N/A