Skip to content

[FT] Add regression tests to improve library stability with vLLM backend #872

@lewtun

Description

@lewtun

Issue encountered

There are currently no integrations tests for the vLLM backend which means the following often happens:

  1. Bumping vllm to a new version breaks the lighteval API
  2. Bumping lighteval introduces unintended regressions due to refactors or changes to the public API

In addition, many prompt functions are not unit tested, so the expected inputs / outputs are only verified when users report issues in the downstream evaluation metrics.

Solution/Feature

All of the above points can be addressed by having end-to-end regression tests that run on multiple GPUs to test:

  • Regressions on DP
  • Regressions on TP
  • Regressions on benchmark datasets, metrics, etc

For post-training benchmarks, I suggest we use the Qwen/Qwen3-0.6B model as the reference since it is fast to run and supports hybrid reasoning via /think and /no_think system prompts.

I've created a table here for the most critical evals: https://docs.google.com/spreadsheets/d/1He7zlSBNv2EiQ9hwpMhGNFzFG1A8E7XvPG3d5yvcBNA/edit?usp=sharing

I'll let @loubnabnl @hynky1999 @guipenedo add the corresponding core evals for pre-training.

The basic idea is to run these core evals on a daily basis (with both DP and TP) so that regressions can be caught before end users are affected.

Also note, that to catch potential issues with future vllm, transformers releases etc we should add a separate set of integration tests against the main branch of those projects. Here is how we do it in TRL: https://github.com/huggingface/trl/blob/main/.github/workflows/tests_latest.yml

Possible alternatives

N/A

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions