[FT] Add regression tests to improve library stability with vLLM backend

## Issue encountered

There are currently no integrations tests for the vLLM backend which means the following often happens:

1. Bumping `vllm` to a new version breaks the `lighteval` API
2. Bumping `lighteval` introduces unintended regressions due to refactors or changes to the public API

In addition, many prompt functions are not unit tested, so the expected inputs / outputs are only verified when users report issues in the downstream evaluation metrics. 

## Solution/Feature

All of the above points can be addressed by having end-to-end regression tests that run on multiple GPUs to test:

* Regressions on DP
* Regressions on TP
* Regressions on benchmark datasets, metrics, etc 

For post-training benchmarks, I suggest we use the `Qwen/Qwen3-0.6B` model as the reference since it is fast to run and supports hybrid reasoning via `/think` and `/no_think` system prompts. 

I've created a table here for the most critical evals: https://docs.google.com/spreadsheets/d/1He7zlSBNv2EiQ9hwpMhGNFzFG1A8E7XvPG3d5yvcBNA/edit?usp=sharing

I'll let @loubnabnl @hynky1999 @guipenedo add the corresponding core evals for pre-training.

The basic idea is to run these core evals on a daily basis (with both DP and TP) so that regressions can be caught before end users are affected. 

Also note, that to catch potential issues with future `vllm`, `transformers` releases etc we should add a separate set of integration tests against the `main` branch of those projects. Here is how we do it in TRL: https://github.com/huggingface/trl/blob/main/.github/workflows/tests_latest.yml

## Possible alternatives

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FT] Add regression tests to improve library stability with vLLM backend #872

Issue encountered

Solution/Feature

Possible alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FT] Add regression tests to improve library stability with vLLM backend #872

Description

Issue encountered

Solution/Feature

Possible alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions