Skip to content

Added post processing (for reasoning tokens) to pipeline #882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 4, 2025
Merged

Conversation

clefourrier
Copy link
Member

@clefourrier clefourrier commented Jul 25, 2025

Fix #869
Tested on Qwen3-0.6B and AIME25: when you select --remove-reasoning-tags, it indeeds removes them before sending the answer to the metric (but the model is quite verbose, so a lot of sentences are not changed because the token never appears.

Edit: updated todos based on observation and Lewis feedback:

  • need to add correct parsing for reasoning tags provided by the user
  • test on AIME24 instead of 25
# Reasoning (/think) and not removing tokens
lighteval vllm 'model_name=Qwen/Qwen3-0.6B,data_parallel_size=4,use_chat_template=True,max_model_length=38912,max_num_batched_tokens=100000,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:38912},system_prompt=/think' 'lighteval|aime24_avg|0|0,extended|ifeval|0|0' --save-details --output-dir /fsx/clementine/lighteval/_logs/slurm

## 0.6B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.3290|±  |0.0202| 
|lighteval:aime24_avg:0|       |math_avg@64            |0.1000|±  |0.0343|

## 1.7B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.3530|±  |0.0206|
|lighteval:aime24_avg:0|       |math_avg@64            |0.4052|±  |0.0625|
# Reasoning (/think) and removing tokens
lighteval vllm 'model_name=Qwen/Qwen3-0.6B,data_parallel_size=8,use_chat_template=True,max_num_batched_tokens=100000,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:38912},system_prompt=/think' 'lighteval|aime24_avg|0|0,extended|ifeval|0|0' --remove-reasoning-tags --save-details --output-dir /fsx/clementine/lighteval/_logs/slurm

## 0.6B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.5860|±  |0.0212| # vs 59.2
|lighteval:aime24_avg:0|       |math_avg@64            |0.1104|±  |0.0369| # vs 10.7

## 1.7B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.6839|±  |0.0200| # vs 72.5
|lighteval:aime24_avg:0|       |math_avg@64            |0.4130|±  |0.0624| # vs 48.3
# No reasoning
lighteval vllm 'model_name=Qwen/Qwen3-0.6B,data_parallel_size=8,use_chat_template=True,max_num_batched_tokens=100000,generation_parameters={temperature:0.7,top_p:0.80,top_k:20,min_p:0,presence_penalty:1.5,max_new_tokens:32768},system_prompt=/nothink' 'lighteval|aime24_avg|0|0,extended|ifeval|0|0' --save-details --output-dir /fsx/clementine/lighteval/_logs/slurm

## 0.6B
|         Task         |Version|        Metric         |Value |   |Stderr|
|----------------------|-------|-----------------------|-----:|---|-----:|
|extended:ifeval:0     |       |prompt_level_strict_acc|0.3235|±  |0.0201| # vs 54.5
|lighteval:aime24_avg:0|       |math_avg@64            |0.1073|±  |0.0382| # vs 3.4

  • test on IFEVAL
  • add an integration test at the pipeline/main level
  • add docs with magistral example

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@clefourrier clefourrier requested a review from lewtun July 25, 2025 13:14
HELP_PANNEL_NAME_2 = "Logging Parameters"
HELP_PANNEL_NAME_3 = "Debug Parameters"
HELP_PANNEL_NAME_4 = "Modeling Parameters"
HELP_PANEL_NAME_1 = "Common Parameters"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, unrelated to the PR

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this feature so quickly! Overall the logic looks sound, but to be sure could we:

  • Evaluate a few popular reasoning models before/after the change on e.g. AIME24 & IFEval?
  • Add some unit tests to check that if --remove-reasoning-tags is set to True/False then the desired post-processing is applied?
  • Add some small docs / example somewhere which explains this flag? If you want an example of a reasoning model with different think tags, checkout Magistral

If possible, it would also be great if we can store both the raw and post-processed predictions in the details. This would be helpful for debugging / understanding if a poor score is due to unterminated reasoning block

@clefourrier
Copy link
Member Author

Ok thanks for the feature list (and yep def for tests at the main level!)
Will do this first thing Monday

@lewtun
Copy link
Member

lewtun commented Jul 28, 2025

For posterity, could you also share the command you're using to reproduce the AIME scores in the PR description?

Note that Qwen3 use different sampling parameters for the /think and /no_think modes (full set described in #872)

@clefourrier clefourrier marked this pull request as draft July 29, 2025 07:00
@clefourrier
Copy link
Member Author

Updating everything to make sure I'm following your args then, will also update the table

gold_index (list): Indices of the gold targets among the [`choices`]
metrics (dict): Metric name to current example score

doc (Doc): The [`Doc`] object containing the current example information.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to PR, incorrect doc was updated

@clefourrier
Copy link
Member Author

clefourrier commented Jul 31, 2025

@lewtun Ok, added tests, updated doc, added better management of tokens (better checks to make sure they are valid).
I feel like the scores I'm getting for Qwen3, reasoning with tokens removed are not too far.
Wdyt? :)
Feel free to tell me if you want anything else!

I'm getting a mismatch for Qwen3, no reasoning, but it's unrelated to the thinking tokens PR so unsure if I should take it into account, (plus the eval is @64 it's terribly slow ~5h for a 2B run with vllm, DP8)

@clefourrier clefourrier marked this pull request as ready for review August 1, 2025 10:08
@clefourrier
Copy link
Member Author

@lewtun merging cause I need it for the current big eval, feel free to add comments here later

@clefourrier clefourrier merged commit d7beacb into main Aug 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FT] Remove thinking from all evals
3 participants