-
Notifications
You must be signed in to change notification settings - Fork 532
New medium track WR: 1404s. Snoo Optimizer. Includes #124 and #119 #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
This is cool. I know little on optimizers. Looking at the code, I'm reading this as, every 28 steps, compute the distance traveled over those steps, undo it, and then move in that direction more smoothly with nesterov+momentum+SGD. The p value is not below the 0.01 requirement. Adding back some of the 60 steps might be worthwhile here. Otherwise the challenge ends up in a state where nobody can contribute without burning money on 80+ runs because the record is so close to the cutoff that its impossible to get p<0.01. Hard for me to tell exactly how much of an improvement this is since prior record had mean of 2.9191 and this one has mean of 2.919656- I'd estimate there is some improvement but not quite 60 steps worth. |
|
Ok yea that's fair. I reran with 5640 steps and now the p-value is below the 0.01 requirement.
Yes, basically.
df_nanogpt_med_5640['train_time'].mean()=np.float64(1404029.9333333333) |
|
Awesome! 50 steps is big.
At one point I tried testing with this because I figured it would be way easier to assess changes, but unfortunately the runtime was substantially worse and the loss curve followed a different trajectory iirc. Made it so a change could be good under deterministic algos but bad under stochastic. As a result, I never went back to testing with deterministic algos. I think the issue was primarily because of the bfloat16 params and fp8 lm_head on the short track. In addition, I think deterministic algos creates a risk people will curve fit every parameter to the validation set. There is already some amount of curve fitting but stochastic algos helps mitigate that. |
Summary
This PR builds on PR#124 and adds the Snoo optimizer (Sparse Nesterov Outer Optimizer) which improves the medium WR by 60 steps, (~10s).
Snoo is a look-ahead momentum-based wrapper that can improve the quality of large language models (LLM) and other models. Snoo implicitly smoothens the training trajectory and instills a bias towards flatter minima. Snoo is computationally efficient, incurring minimal overhead in compute and moderate memory usage.
@dominikkallusky, @vishal9-team, @vinaysrao
Code
Stats:
Count: 42
Train Time:
Val Loss: