Add TRL GRPO Reasoning with Advanced Reward notebook #319

behroozazarkhalili · 2025-07-26T17:23:27Z

Advanced GRPO Fine-tuning for Mathematical Reasoning with Multi-Reward Training

This PR adds a comprehensive notebook demonstrating advanced GRPO (Group Relative Policy Optimization) for mathematical reasoning tasks using a sophisticated multi-reward training system.

Key Features

🧠 Advanced Training Approach

4 Specialized Reward Functions: Format compliance, approximate matching, answer correctness, and number extraction
Multi-Reward System: Comprehensive evaluation of different aspects of mathematical reasoning
Structured Output: Enforces step-by-step reasoning format with clear solution sections

⚡ Memory-Efficient Implementation

4-bit Quantization: ~75% memory reduction using BitsAndBytesConfig
LoRA Fine-tuning: Train only ~0.1% of parameters while maintaining performance
Consumer GPU Friendly: Optimized for single GPU training with gradient accumulation

📊 Comprehensive Experiment Tracking

Trackio Integration: Real-time training metrics logging and visualization
Interactive Dashboard: Monitor reward scores, KL divergence, policy gradients, and completion statistics
Training Analytics: Track format compliance, mathematical accuracy, and model behavior

🎯 Production-Ready Features

GSM8K Dataset: Grade school math problems requiring multi-step reasoning
Qwen2.5-3B-Instruct: Instruction-tuned model optimized for reasoning tasks
Evaluation Framework: Structured output validation and accuracy testing
Resource Management: GPU memory cleanup and experiment organization

Technical Implementation

The notebook demonstrates:

Setting up quantized models with LoRA adapters
Implementing custom reward functions for mathematical reasoning
Configuring GRPO training with memory constraints
Real-time experiment tracking with interactive visualizations
Model evaluation and structured output validation

Usage

The notebook is self-contained and includes:

Detailed explanations of each component
Memory-efficient configurations for consumer hardware
Interactive experiment tracking setup
Comprehensive evaluation and testing procedures

This implementation serves as a practical guide for researchers and practitioners working on mathematical reasoning with RLHF techniques.

review-notebook-app · 2025-07-26T17:23:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

sergiopaniego

Thanks for the addition! 😄
We already have a pretty similar example "Post training an LLM for reasoning with GRPO in TRL".
The idea of the repo is to have end-to-end recipes with extended explanations, so I'd suggest:

Extending the explanations throughout the recipe of the example.
Link the previous example and make a clear distinction between them, explaining it at the beginning. Otherwise, it could lead to confusion for a possible reader looking for an example of GRPO.

The recipes can be opened in Colab and maybe run, so I'd also be nice to keep that in mind. For example when doing os.environ["CUDA_VISIBLE_DEVICES"] = "1" since in Colab there is only 1 GPU.

notebooks/en/index.md

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

sergiopaniego

Could you also resolve the conflicts with main? 😄

This notebook demonstrates how to use TRL (Transformers Reinforcement Learning) with GRPO (Group Relative Policy Optimization) for reasoning tasks with advanced reward mechanisms. - Added notebook with proper lowercase filename - Updated _toctree.yml and index.md - Added proper author attribution - Cleaned non-informative outputs Contributed by: Behrooz Azarkhalili

- Remove torch and accelerate from installation (dependencies of TRL) - Remove pad token check (handled automatically) - Restore num_generations to default value (8) - Remove remove_unused_columns parameter (false by default) - Remove processing_class parameter (loaded automatically)

…O recipe - Add direct link to existing HuggingFace GRPO cookbook example - Fix CUDA device setting for Colab compatibility (auto-detect instead of hardcoded) - Add comprehensive explanations throughout all recipe sections - Enhance with detailed comparison table showing differences from basic example - Improve GPU setup with memory information and fallback instructions - Add detailed LoRA configuration explanations and parameter analysis - Expand dataset preparation with GSM8K background and format details - Detail multi-reward system design for mathematical reasoning approach - Optimize training configuration with Colab-specific memory settings - Enhance testing and evaluation with detailed response analysis - Make notebook fully end-to-end recipe focused for cookbook standards - Address all reviewer feedback comprehensively for cookbook contribution

…anup Major improvements to GRPO mathematical reasoning notebook: Content Organization: - Streamlined introduction removing verbose explanations - Simplified installation and setup sections with clear instructions - Updated all markdown cells to be concise and action-oriented - Improved inline comments to explain technical decisions and "why" behind code Technical Enhancements: - Added trackio experiment tracking with comprehensive configuration - Implemented timestamp-based unique run naming for session separation - Enhanced logging configuration to suppress verbose HTTP request logs - Optimized training parameters for mathematical reasoning tasks - Improved model evaluation section with structured output validation Code Quality: - Clean, consistent formatting across all 38 cells - Removed decorative print statements and emojis from evaluation section - Added proper error handling and documentation - Streamlined resource management and GPU memory optimization Resource Management: - Added remove_trackio_project() function for database cleanup - Comprehensive cleanup section with storage management - Warning comments about permanent data deletion - Proper resource freeing with GPU cache clearing Testing and Validation: - Enhanced model testing with optimized generation parameters - Improved format compliance checking with detailed validation - Better answer accuracy verification with extraction methods - Comprehensive response analysis and debugging output This represents the final polished version ready for production use, incorporating all previous feedback and implementing best practices for educational content, technical accuracy, and resource management.

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb

sergiopaniego

Just a final comment regarding the removal of Open in Colab button and we're ready!

Integrates trackio experiment tracking to capture and visualize GRPO training metrics including reward scores, KL divergence, policy gradients, and completion statistics. Also removes unnecessary Open In Colab button as it's automatically added by the platform.

behroozazarkhalili · 2025-08-27T14:19:14Z

Just a final comment regarding the removal of Open in Colab button and we're ready!

Resolved. It came back during rebase :)

notebooks/en/index.md

sergiopaniego

thanks!!

HuggingFaceDocBuilderDev · 2025-08-27T14:48:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec reviewed Jul 26, 2025

View reviewed changes

sergiopaniego reviewed Jul 29, 2025

View reviewed changes

notebooks/en/index.md Show resolved Hide resolved

sergiopaniego reviewed Aug 11, 2025

View reviewed changes

sergiopaniego mentioned this pull request Aug 11, 2025

Add Function Calling Fine-tuning LLMs on xLAM Dataset notebook #321

Open

11 tasks

behroozazarkhalili added 4 commits August 23, 2025 18:38

behroozazarkhalili force-pushed the add-grpo-advanced-reward-notebook branch from e6e5cbb to 72a5d43 Compare August 24, 2025 01:39

sergiopaniego reviewed Aug 26, 2025

View reviewed changes

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb Show resolved Hide resolved

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb Show resolved Hide resolved

notebooks/en/trl_grpo_reasoning_advanced_reward.ipynb Show resolved Hide resolved

behroozazarkhalili force-pushed the add-grpo-advanced-reward-notebook branch 2 times, most recently from aa05321 to 93bcc03 Compare August 26, 2025 18:14

sergiopaniego reviewed Aug 27, 2025

View reviewed changes

behroozazarkhalili force-pushed the add-grpo-advanced-reward-notebook branch from 93bcc03 to 1051b4c Compare August 27, 2025 14:18

sergiopaniego reviewed Aug 27, 2025

View reviewed changes

notebooks/en/index.md Outdated Show resolved Hide resolved

Update notebooks/en/index.md

b711c5a

sergiopaniego approved these changes Aug 27, 2025

View reviewed changes

sergiopaniego merged commit 152ea5c into huggingface:main Aug 27, 2025
1 check passed

behroozazarkhalili deleted the add-grpo-advanced-reward-notebook branch August 27, 2025 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Add TRL GRPO Reasoning with Advanced Reward notebook #319

behroozazarkhalili commented Jul 26, 2025 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Uh oh!

behroozazarkhalili commented Aug 27, 2025

Uh oh!

Uh oh!

sergiopaniego left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Add TRL GRPO Reasoning with Advanced Reward notebook #319

Conversation

behroozazarkhalili commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Advanced GRPO Fine-tuning for Mathematical Reasoning with Multi-Reward Training

Key Features

Technical Implementation

Usage

Uh oh!

review-notebook-app bot commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

behroozazarkhalili commented Aug 27, 2025

Uh oh!

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

behroozazarkhalili commented Jul 26, 2025 •

edited

Loading