Skip to content

"Save best" and other improvements to model checkpointing, Issue #307#334

Merged
shanjiaz merged 8 commits into
vllm-project:mainfrom
surojitiitg:feature/checkpoint-freq-save-best
Mar 20, 2026
Merged

"Save best" and other improvements to model checkpointing, Issue #307#334
shanjiaz merged 8 commits into
vllm-project:mainfrom
surojitiitg:feature/checkpoint-freq-save-best

Conversation

@surojitiitg

Copy link
Copy Markdown
Contributor

Summary

This PR improves checkpoint saving behavior during training by adding configurable checkpoint frequency and stable best-checkpoint tracking.

Changes

  • Files modified: src/speculators/train/trainer.py, src/speculators/train/checkpointer.py, scripts/train.py
  • Files added: tests/unit/train/test_checkpoint.py
  • Added --checkpoint-freq to control how often numbered checkpoints are saved
  • Added --save-best to maintain a stable checkpoint_best/ path
  • Implemented symlink support so checkpoint_best/ points to the best saved checkpoint directory
  • Kept default behavior backward compatible by using checkpoint_freq=1
  • Updated validation flow in src/speculators/train/trainer.py so that val_epoch(...) returns the computed val_metrics dictionary to the training loop
  • Best-checkpoint selection uses the key loss_epoch from the dictionary returned by self.val_epoch(epoch) as the monitored metric. The implementation is carried out in src/speculators/train/trainer.py
  • Added tests/unit/train/test_checkpoint.py, which contains unit tests for:
    • ignoring checkpoint_best during resume checkpoint discovery
    • creating and updating the best-checkpoint symlink
    • checkpoint selection behavior during training

Behavior

  • If --checkpoint-freq and --save-best are not provided, the current checkpointing behavior is preserved (backward compatible defaults).
  • checkpoint_freq=1 preserves the current behavior of saving checkpoints every epoch
  • When save_best=True, checkpoint_best/ points to the checkpoint with the lowest validation loss among the saved checkpoints
  • Best-checkpoint tracking is based on val_metrics["loss_epoch"] returned from the function val_epoch(...)
  • Resume behavior still loads the latest numbered checkpoint and ignores checkpoint_best/

Notes

  • The first epoch checkpoint is intentionally saved as part of the current behavior. This provides a a guaranteed recovery point and early verification artifact.
  • checkpoint_best/ is implemented as a symlink to avoid duplicating checkpoint files

Testing

Tested with:

  • Unit tests for checkpoint selection and symlink behavior. The test file is located in tests/unit/train/test_checkpoint.py
  • Did testing to observe the file structure of the checkpointing via scripts/train.py using:
CUDA_VISIBLE_DEVICES=1 python scripts/train.py \
  --verifier-name-or-path meta-llama/Llama-3.1-8B-Instruct \
  --data-path output \
  --save-path checkpoints_smoke \
  --log-dir logs_smoke \
  --epochs 2 \
  --lr 3e-5 \
  --total-seq-len 256 \
  --data-format-version 1 \
  --num-layers 1 \
  --d2t-path vocab_mapping/d2t.npy \
  --t2d-path vocab_mapping/t2d.npy \
  --num-workers 1 \
  --prefetch-factor 2 \
  --checkpoint-freq 2 \
  --save-best
  • manual verification of checkpoint directory structure and checkpoint_best/ symlink target
  • confirmed backward compatibility by running without --checkpoint-freq and --save-best (current behavior maintained)

Style/Lint Notes

  • On running make style the program suggests to make modification at src/speculators/train/trainer.py:204:13. Since this is unrelated to my modifications, I have not updated it to satisfy make style.
  • make style also suggests to make import blocks sorted in src/speculators/train/checkpointer.py:1:1. However, since the suggestion is also present for the current version of the repo, I have not made any changes to the arrangement of the import block.

@rahul-tuli

rahul-tuli commented Mar 5, 2026

Copy link
Copy Markdown
Collaborator

I think some IDE files were also added as a part of this diff, could you remove them?

@surojitiitg

Copy link
Copy Markdown
Contributor Author

I think some IDE files were also added as a part of this diff, could you remove them?

@rahul-tuli thanks for catching that. I have removed them. Let me know if any further modifications are required.

@rahul-tuli rahul-tuli left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending comments, great first pass!

Comment thread src/speculators/train/checkpointer.py
Comment thread scripts/train.py
Comment thread src/speculators/train/trainer.py Outdated
Comment thread src/speculators/train/trainer.py Outdated
Comment thread .gitignore Outdated
@rahul-tuli rahul-tuli requested review from dsikka, fynnsu and shanjiaz March 5, 2026 12:54
Comment thread src/speculators/train/trainer.py Outdated
@mergify

mergify Bot commented Mar 14, 2026

Copy link
Copy Markdown

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @surojitiitg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 14, 2026
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
@surojitiitg surojitiitg force-pushed the feature/checkpoint-freq-save-best branch from d1820ef to 988cbc2 Compare March 14, 2026 21:17
@mergify mergify Bot removed the needs-rebase label Mar 14, 2026
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
@surojitiitg surojitiitg force-pushed the feature/checkpoint-freq-save-best branch from 988cbc2 to c9aebc1 Compare March 14, 2026 21:22
Comment thread src/speculators/train/trainer.py Outdated
Comment thread src/speculators/train/trainer.py Outdated
Comment thread src/speculators/train/trainer.py Outdated
Comment thread src/speculators/train/trainer.py
@surojitiitg surojitiitg force-pushed the feature/checkpoint-freq-save-best branch from 545503e to f4024ea Compare March 17, 2026 13:31
@fynnsu

fynnsu commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator

@surojitiitg Please fix quality issues.
You can run make style to try to auto-fix them
and make quality to confirm that they are all passing.

Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
@surojitiitg surojitiitg force-pushed the feature/checkpoint-freq-save-best branch from f4024ea to d5e0dd7 Compare March 17, 2026 17:16
Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
@surojitiitg surojitiitg force-pushed the feature/checkpoint-freq-save-best branch from 8559129 to 7433ad5 Compare March 18, 2026 16:24

@fynnsu fynnsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you for making all the updates.

@shanjiaz shanjiaz enabled auto-merge (squash) March 20, 2026 13:42
@shanjiaz shanjiaz merged commit b32ae8c into vllm-project:main Mar 20, 2026
12 checks passed
YzTongNiar pushed a commit to YzTongNiar/speculators that referenced this pull request Apr 10, 2026
…-project#307 (vllm-project#334)

## Summary

This PR improves checkpoint saving behavior during training by adding
configurable checkpoint frequency and stable best-checkpoint tracking.

## Changes

- Files modified: `src/speculators/train/trainer.py`,
`src/speculators/train/checkpointer.py`, `scripts/train.py`
- Files added: `tests/unit/train/test_checkpoint.py`
- Added `--checkpoint-freq` to control how often numbered checkpoints
are saved
- Added `--save-best` to maintain a stable `checkpoint_best/` path 
- Implemented symlink support so `checkpoint_best/` points to the best
saved checkpoint directory
- Kept default behavior backward compatible by using `checkpoint_freq=1`
- Updated validation flow in `src/speculators/train/trainer.py` so that
`val_epoch(...)` returns the computed `val_metrics` dictionary to the
training loop
- Best-checkpoint selection uses the key `loss_epoch` from the
dictionary returned by `self.val_epoch(epoch)` as the monitored metric.
The implementation is carried out in `src/speculators/train/trainer.py`
- Added `tests/unit/train/test_checkpoint.py`, which contains unit tests
for:
  - ignoring `checkpoint_best` during resume checkpoint discovery
  - creating and updating the best-checkpoint symlink
  - checkpoint selection behavior during training

## Behavior

- If `--checkpoint-freq` and `--save-best` are not provided, the current
checkpointing behavior is preserved (backward compatible defaults).
- `checkpoint_freq=1` preserves the current behavior of saving
checkpoints every epoch
- When `save_best=True`, `checkpoint_best/` points to the checkpoint
with the lowest validation loss among the saved checkpoints
- Best-checkpoint tracking is based on `val_metrics["loss_epoch"]`
returned from the function `val_epoch(...)`
- Resume behavior still loads the latest numbered checkpoint and ignores
`checkpoint_best/`

## Notes

- The first epoch checkpoint is intentionally saved as part of the
current behavior. This provides a a guaranteed recovery point and early
verification artifact.
- `checkpoint_best/` is implemented as a symlink to avoid duplicating
checkpoint files

## Testing

Tested with:
- Unit tests for checkpoint selection and symlink behavior. The test
file is located in `tests/unit/train/test_checkpoint.py`
- Did testing to observe the file structure of the checkpointing via
`scripts/train.py` using:

```bash
CUDA_VISIBLE_DEVICES=1 python scripts/train.py \
  --verifier-name-or-path meta-llama/Llama-3.1-8B-Instruct \
  --data-path output \
  --save-path checkpoints_smoke \
  --log-dir logs_smoke \
  --epochs 2 \
  --lr 3e-5 \
  --total-seq-len 256 \
  --data-format-version 1 \
  --num-layers 1 \
  --d2t-path vocab_mapping/d2t.npy \
  --t2d-path vocab_mapping/t2d.npy \
  --num-workers 1 \
  --prefetch-factor 2 \
  --checkpoint-freq 2 \
  --save-best
```
- manual verification of checkpoint directory structure and
checkpoint_best/ symlink target
- confirmed backward compatibility by running without
`--checkpoint-freq` and `--save-best` (current behavior maintained)

## Style/Lint Notes

- On running `make style` the program suggests to make modification at
`src/speculators/train/trainer.py:204:13`. Since this is unrelated to my
modifications, I have not updated it to satisfy `make style`.
- `make style` also suggests to make import blocks sorted in
`src/speculators/train/checkpointer.py:1:1`. However, since the
suggestion is also present for the current version of the repo, I have
not made any changes to the arrangement of the import block.

---------

Signed-off-by: surojitiitg@gmail.com <surojitiitg@gmail.com>
Co-authored-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants