"Save best" and other improvements to model checkpointing, Issue #307#334
Merged
shanjiaz merged 8 commits intoMar 20, 2026
Merged
Conversation
Collaborator
|
I think some IDE files were also added as a part of this diff, could you remove them? |
Contributor
Author
@rahul-tuli thanks for catching that. I have removed them. Let me know if any further modifications are required. |
rahul-tuli
reviewed
Mar 5, 2026
rahul-tuli
left a comment
Collaborator
There was a problem hiding this comment.
LGTM, pending comments, great first pass!
shanjiaz
reviewed
Mar 5, 2026
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
d1820ef to
988cbc2
Compare
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
988cbc2 to
c9aebc1
Compare
fynnsu
reviewed
Mar 16, 2026
545503e to
f4024ea
Compare
Collaborator
|
@surojitiitg Please fix quality issues. |
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
f4024ea to
d5e0dd7
Compare
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
8559129 to
7433ad5
Compare
fynnsu
approved these changes
Mar 19, 2026
fynnsu
left a comment
Collaborator
There was a problem hiding this comment.
Looks good! Thank you for making all the updates.
shanjiaz
approved these changes
Mar 20, 2026
YzTongNiar
pushed a commit
to YzTongNiar/speculators
that referenced
this pull request
Apr 10, 2026
…-project#307 (vllm-project#334) ## Summary This PR improves checkpoint saving behavior during training by adding configurable checkpoint frequency and stable best-checkpoint tracking. ## Changes - Files modified: `src/speculators/train/trainer.py`, `src/speculators/train/checkpointer.py`, `scripts/train.py` - Files added: `tests/unit/train/test_checkpoint.py` - Added `--checkpoint-freq` to control how often numbered checkpoints are saved - Added `--save-best` to maintain a stable `checkpoint_best/` path - Implemented symlink support so `checkpoint_best/` points to the best saved checkpoint directory - Kept default behavior backward compatible by using `checkpoint_freq=1` - Updated validation flow in `src/speculators/train/trainer.py` so that `val_epoch(...)` returns the computed `val_metrics` dictionary to the training loop - Best-checkpoint selection uses the key `loss_epoch` from the dictionary returned by `self.val_epoch(epoch)` as the monitored metric. The implementation is carried out in `src/speculators/train/trainer.py` - Added `tests/unit/train/test_checkpoint.py`, which contains unit tests for: - ignoring `checkpoint_best` during resume checkpoint discovery - creating and updating the best-checkpoint symlink - checkpoint selection behavior during training ## Behavior - If `--checkpoint-freq` and `--save-best` are not provided, the current checkpointing behavior is preserved (backward compatible defaults). - `checkpoint_freq=1` preserves the current behavior of saving checkpoints every epoch - When `save_best=True`, `checkpoint_best/` points to the checkpoint with the lowest validation loss among the saved checkpoints - Best-checkpoint tracking is based on `val_metrics["loss_epoch"]` returned from the function `val_epoch(...)` - Resume behavior still loads the latest numbered checkpoint and ignores `checkpoint_best/` ## Notes - The first epoch checkpoint is intentionally saved as part of the current behavior. This provides a a guaranteed recovery point and early verification artifact. - `checkpoint_best/` is implemented as a symlink to avoid duplicating checkpoint files ## Testing Tested with: - Unit tests for checkpoint selection and symlink behavior. The test file is located in `tests/unit/train/test_checkpoint.py` - Did testing to observe the file structure of the checkpointing via `scripts/train.py` using: ```bash CUDA_VISIBLE_DEVICES=1 python scripts/train.py \ --verifier-name-or-path meta-llama/Llama-3.1-8B-Instruct \ --data-path output \ --save-path checkpoints_smoke \ --log-dir logs_smoke \ --epochs 2 \ --lr 3e-5 \ --total-seq-len 256 \ --data-format-version 1 \ --num-layers 1 \ --d2t-path vocab_mapping/d2t.npy \ --t2d-path vocab_mapping/t2d.npy \ --num-workers 1 \ --prefetch-factor 2 \ --checkpoint-freq 2 \ --save-best ``` - manual verification of checkpoint directory structure and checkpoint_best/ symlink target - confirmed backward compatibility by running without `--checkpoint-freq` and `--save-best` (current behavior maintained) ## Style/Lint Notes - On running `make style` the program suggests to make modification at `src/speculators/train/trainer.py:204:13`. Since this is unrelated to my modifications, I have not updated it to satisfy `make style`. - `make style` also suggests to make import blocks sorted in `src/speculators/train/checkpointer.py:1:1`. However, since the suggestion is also present for the current version of the repo, I have not made any changes to the arrangement of the import block. --------- Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com> Co-authored-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves checkpoint saving behavior during training by adding configurable checkpoint frequency and stable best-checkpoint tracking.
Changes
src/speculators/train/trainer.py,src/speculators/train/checkpointer.py,scripts/train.pytests/unit/train/test_checkpoint.py--checkpoint-freqto control how often numbered checkpoints are saved--save-bestto maintain a stablecheckpoint_best/pathcheckpoint_best/points to the best saved checkpoint directorycheckpoint_freq=1src/speculators/train/trainer.pyso thatval_epoch(...)returns the computedval_metricsdictionary to the training looploss_epochfrom the dictionary returned byself.val_epoch(epoch)as the monitored metric. The implementation is carried out insrc/speculators/train/trainer.pytests/unit/train/test_checkpoint.py, which contains unit tests for:checkpoint_bestduring resume checkpoint discoveryBehavior
--checkpoint-freqand--save-bestare not provided, the current checkpointing behavior is preserved (backward compatible defaults).checkpoint_freq=1preserves the current behavior of saving checkpoints every epochsave_best=True,checkpoint_best/points to the checkpoint with the lowest validation loss among the saved checkpointsval_metrics["loss_epoch"]returned from the functionval_epoch(...)checkpoint_best/Notes
checkpoint_best/is implemented as a symlink to avoid duplicating checkpoint filesTesting
Tested with:
tests/unit/train/test_checkpoint.pyscripts/train.pyusing:--checkpoint-freqand--save-best(current behavior maintained)Style/Lint Notes
make stylethe program suggests to make modification atsrc/speculators/train/trainer.py:204:13. Since this is unrelated to my modifications, I have not updated it to satisfymake style.make stylealso suggests to make import blocks sorted insrc/speculators/train/checkpointer.py:1:1. However, since the suggestion is also present for the current version of the repo, I have not made any changes to the arrangement of the import block.