Add efficient validation on HellaSwag #89

trianxy · 2025-03-13T12:41:17Z

This PR adds optional validation using HellaSwag accuracy (on 10042 validation examples).

It is controlled by setting the hyperparameter hellaswag=True.

Motivation
I am trying to make the model smarter by adjusting the architecture/training objective. And I wanted to have benchmarks like HellaSwag which may better correlate with smarter models than loss. By the way: Having this repo which makes a model converge quickly is super helpful!

Not sure if it's worth it to merge into master. Some thoughts:

Pros:
- more data/confidence on whether a change improved the model
- related to the above bullet point, it can improve comparability to other models and training runs
Cons:
- adds ~150 lines of code
- adds ~40s wall time to script when running on 8 x H100:
  - ~30s to download data + preprocess + warm-up
  - ~6s to run validations (0.4s every time we run validation on 10042 examples: the code packs 400-800 HellaSwag tasks into 1 sequence which is then evaluated using 1 forward pass. 2 forward passes are required for each of the 8 GPUs to process all examples)

Happy to adjust the PR if helpful.

trianxy · 2025-05-05T08:52:53Z

Hey @KellerJordan , checking in if there's anything I can do to make this PR easier to review, or make it more valuable?

Also, feel free to close it if you feel it's out of scope.

KellerJordan · 2025-05-25T22:25:59Z

Hi @trianxy, thanks for the PR. I think adding code to easily get the HellaSwag score is a good idea; however, I want to keep it out of train_gpt.py to avoid complexity and wallclock increase. Probably for me to accept a PR adding HellaSwag, it would instead be adding a hellaswag.py file into data/ which calculates the score by consumes a checkpoint generated by putting save_checkpoint=True into the Args in train_gpt.py.

trianxy · 2025-05-28T09:11:50Z

Sounds good @KellerJordan - I'll take a stab at adjusting this PR accordingly

trianxy · 2025-08-12T08:56:15Z

Hey @gszfwsb , thanks for the approval. Before we merge, let me implement what @KellerJordan mentioned in #89 (comment) and also resolve the conflicts.

Haven't gotten around it yet, but am looking to do it in September.

After that, I'd ask you to take one more look again.

trianxy · 2025-08-27T11:24:21Z

Hey @gszfwsb , I just updated this PR to incorporate @KellerJordan 's suggestion to move the HellaSwag code into a separate file data/hellaswag.py.

To both of you: Feel free to let me know if there's anything else I can do on my side.

FYI

I didn't change how train_gpt.py works, but I kind of had to move a bunch of code behind a clause if __name__ == "__main__"..., because I wanted to import the model from that file without re-running the training loop.

There would have been a few options to implement this PR without touching train_gpt.py, e.g.

Option 1: read the source code of train_gpt.py into a string, remove unnecessary stuff, and then exec(...) it
Option 2: use the module ast to only load from the source code those objects which are necessary to define the model architecture

I decided against these, because they felt less maintainable and more likely to break in the future.

gszfwsb approved these changes Aug 5, 2025

View reviewed changes

trianxy force-pushed the hellaswag branch 2 times, most recently from 9b8fb2c to 473800c Compare August 27, 2025 10:25

Add script to efficiently evaluate model on HellaSwag

61e6647

trianxy force-pushed the hellaswag branch from 473800c to 61e6647 Compare August 27, 2025 11:09

trianxy requested a review from gszfwsb August 28, 2025 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add efficient validation on HellaSwag #89

Add efficient validation on HellaSwag #89

Uh oh!

trianxy commented Mar 13, 2025

Uh oh!

trianxy commented May 5, 2025

Uh oh!

KellerJordan commented May 25, 2025

Uh oh!

trianxy commented May 28, 2025

Uh oh!

trianxy commented Aug 12, 2025

Uh oh!

trianxy commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add efficient validation on HellaSwag #89

Are you sure you want to change the base?

Add efficient validation on HellaSwag #89

Uh oh!

Conversation

trianxy commented Mar 13, 2025

Uh oh!

trianxy commented May 5, 2025

Uh oh!

KellerJordan commented May 25, 2025

Uh oh!

trianxy commented May 28, 2025

Uh oh!

trianxy commented Aug 12, 2025

Uh oh!

trianxy commented Aug 27, 2025

FYI

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants