Added post processing (for reasoning tokens) to pipeline #882

clefourrier · 2025-07-25T09:16:55Z

Fix #869
Tested on Qwen3-0.6B and AIME25: when you select --remove-reasoning-tags, it indeeds removes them before sending the answer to the metric (but the model is quite verbose, so a lot of sentences are not changed because the token never appears.

Edit: updated todos based on observation and Lewis feedback:

need to add correct parsing for reasoning tags provided by the user
test on AIME24 instead of 25

# Reasoning (/think) and not removing tokens
lighteval vllm 'model_name=Qwen/Qwen3-0.6B,data_parallel_size=4,use_chat_template=True,max_model_length=38912,max_num_batched_tokens=100000,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:38912},system_prompt=/think' 'lighteval|aime24_avg|0|0,extended|ifeval|0|0' --save-details --output-dir /fsx/clementine/lighteval/_logs/slurm

## 0.6B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.3290|±  |0.0202| 
|lighteval:aime24_avg:0|       |math_avg@64            |0.1000|±  |0.0343|

## 1.7B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.3530|±  |0.0206|
|lighteval:aime24_avg:0|       |math_avg@64            |0.4052|±  |0.0625|

# Reasoning (/think) and removing tokens
lighteval vllm 'model_name=Qwen/Qwen3-0.6B,data_parallel_size=8,use_chat_template=True,max_num_batched_tokens=100000,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:38912},system_prompt=/think' 'lighteval|aime24_avg|0|0,extended|ifeval|0|0' --remove-reasoning-tags --save-details --output-dir /fsx/clementine/lighteval/_logs/slurm

## 0.6B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.5860|±  |0.0212| # vs 59.2
|lighteval:aime24_avg:0|       |math_avg@64            |0.1104|±  |0.0369| # vs 10.7

## 1.7B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.6839|±  |0.0200| # vs 72.5
|lighteval:aime24_avg:0|       |math_avg@64            |0.4130|±  |0.0624| # vs 48.3

# No reasoning
lighteval vllm 'model_name=Qwen/Qwen3-0.6B,data_parallel_size=8,use_chat_template=True,max_num_batched_tokens=100000,generation_parameters={temperature:0.7,top_p:0.80,top_k:20,min_p:0,presence_penalty:1.5,max_new_tokens:32768},system_prompt=/nothink' 'lighteval|aime24_avg|0|0,extended|ifeval|0|0' --save-details --output-dir /fsx/clementine/lighteval/_logs/slurm

## 0.6B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.3235|±  |0.0201| # vs 54.5
|lighteval:aime24_avg:0|       |math_avg@64            |0.1073|±  |0.0382| # vs 3.4

test on IFEVAL
add an integration test at the pipeline/main level
add docs with magistral example

HuggingFaceDocBuilderDev · 2025-07-25T09:24:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…taclass values

clefourrier · 2025-07-25T13:16:42Z

src/lighteval/main_custom.py

-HELP_PANNEL_NAME_2 = "Logging Parameters"
-HELP_PANNEL_NAME_3 = "Debug Parameters"
-HELP_PANNEL_NAME_4 = "Modeling Parameters"
+HELP_PANEL_NAME_1 = "Common Parameters"


nit, unrelated to the PR

lewtun

Thanks a lot for adding this feature so quickly! Overall the logic looks sound, but to be sure could we:

Evaluate a few popular reasoning models before/after the change on e.g. AIME24 & IFEval?
Add some unit tests to check that if --remove-reasoning-tags is set to True/False then the desired post-processing is applied?
Add some small docs / example somewhere which explains this flag? If you want an example of a reasoning model with different think tags, checkout Magistral

If possible, it would also be great if we can store both the raw and post-processed predictions in the details. This would be helpful for debugging / understanding if a poor score is due to unterminated reasoning block

src/lighteval/models/model_output.py

src/lighteval/metrics/dynamic_metrics.py

clefourrier · 2025-07-25T15:49:29Z

Ok thanks for the feature list (and yep def for tests at the main level!)
Will do this first thing Monday

lewtun · 2025-07-28T19:45:05Z

For posterity, could you also share the command you're using to reproduce the AIME scores in the PR description?

Note that Qwen3 use different sampling parameters for the /think and /no_think modes (full set described in #872)

clefourrier · 2025-07-29T09:30:06Z

Updating everything to make sure I'm following your args then, will also update the table

clefourrier · 2025-07-31T14:31:48Z

src/lighteval/logging/info_loggers.py

-            gold_index (list): Indices of the gold targets among the [`choices`]
-            metrics (dict): Metric name to current example score
-
+            doc (Doc): The [`Doc`] object containing the current example information.


Unrelated to PR, incorrect doc was updated

clefourrier · 2025-07-31T14:35:17Z

@lewtun Ok, added tests, updated doc, added better management of tokens (better checks to make sure they are valid).
I feel like the scores I'm getting for Qwen3, reasoning with tokens removed are not too far.
Wdyt? :)
Feel free to tell me if you want anything else!

I'm getting a mismatch for Qwen3, no reasoning, but it's unrelated to the thinking tokens PR so unsure if I should take it into account, (plus the eval is @64 it's terribly slow ~5h for a 2B run with vllm, DP8)

clefourrier · 2025-08-04T12:05:30Z

@lewtun merging cause I need it for the current big eval, feel free to add comments here later

clefourrier added 2 commits July 25, 2025 11:11

added post processing

44bd18f

default factory for dataclass

d32d683

clefourrier and others added 4 commits July 25, 2025 12:06

style

90f6969

fix

33ae811

fix, tokens should be list of tuples

772e9f7

couple bug fixes with the linter + an actual bug when updating the da…

dfa9b01

…taclass values

clefourrier requested a review from lewtun July 25, 2025 13:14

clefourrier commented Jul 25, 2025

View reviewed changes

lewtun reviewed Jul 25, 2025

View reviewed changes

src/lighteval/models/model_output.py Show resolved Hide resolved

src/lighteval/metrics/dynamic_metrics.py Show resolved Hide resolved

clefourrier added 3 commits July 28, 2025 08:10

doc update for reasoning_pair kwarg

2ff3626

manage user args

7083da4

add aime avg@64 like in qwen paper

4fa53fb

clefourrier marked this pull request as draft July 29, 2025 07:00

clefourrier and others added 9 commits July 29, 2025 13:26

small fix max_len for vllm models

4b8c7c5

up metric

9525a29

unrelated, updated doc which was completely outdated

18a5396

more robust reasoning token management

6349a79

small reorg of lighteval task class to allow mocking

17008e8

test suite

40d04c3

Merge branch 'main' into think

8a64c6d

fix tests

b948696

update doc

94c756d

clefourrier commented Jul 31, 2025

View reviewed changes

clefourrier marked this pull request as ready for review August 1, 2025 10:08

Merge branch 'main' into think

3f56c4f

clefourrier merged commit d7beacb into main Aug 4, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added post processing (for reasoning tokens) to pipeline #882

Added post processing (for reasoning tokens) to pipeline #882

Uh oh!

clefourrier commented Jul 25, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 25, 2025

Uh oh!

clefourrier Jul 25, 2025

Uh oh!

lewtun left a comment

Uh oh!

Uh oh!

Uh oh!

clefourrier commented Jul 25, 2025

Uh oh!

lewtun commented Jul 28, 2025

Uh oh!

clefourrier commented Jul 29, 2025

Uh oh!

clefourrier Jul 31, 2025

Uh oh!

clefourrier commented Jul 31, 2025 •

edited

Loading

Uh oh!

clefourrier commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

Added post processing (for reasoning tokens) to pipeline #882

Added post processing (for reasoning tokens) to pipeline #882

Uh oh!

Conversation

clefourrier commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 25, 2025

Uh oh!

clefourrier Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

clefourrier commented Jul 25, 2025

Uh oh!

lewtun commented Jul 28, 2025

Uh oh!

clefourrier commented Jul 29, 2025

Uh oh!

clefourrier Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

clefourrier commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clefourrier commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

clefourrier commented Jul 25, 2025 •

edited

Loading

clefourrier commented Jul 31, 2025 •

edited

Loading