Add `StatefulDataLoader` to select other recipes #2431

joecummings · 2025-02-24T23:27:15Z

What? This expands the use of StatefulDataloader to the following four recipes:

Full distributed
LoRA single device
LoRA distributed
GRPO

What did you change?
I copied pretty much the exact same changes as I did in #2410 into the above four recipes. I modified the tests for single device recipes b/c we are not relying on a different state. For the distributed recipes, I did not modify the tests at all. I had to change the hack we used to ensure the iterator finish even if we cut the epoch short b/c it was not robust and also looked ugly to manually modify the dataloader state dict.

Why these recipes?
These include our most stable recipes + our newest one GRPO. By looking at the changes I did here, users will be able to propagate changes to everything else in the library easily. I will be creating an Issue to track this for all the rest of the recipes.

How did you test GRPO?
Good question, b/c GRPO has no standardized tests in the torchtune library. It's also an interesting case b/c it does not follow the same format as our other recipes. For instance, there is NO option to cut an epoch short based on the number of steps in an individual epoch. (It's a weird variable anyways). However, there IS an option to reach a total number of training steps and stop. But wait, this wasn't included in the saved state dict!! There is no way to reasonably resume training for GRPO without actually implementing step-based checkpointing. This one is a wash right now :/

pytorch-bot · 2025-02-24T23:27:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2431

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 634be31 with merge base 7b654ea ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ramanishsingh · 2025-02-26T17:39:35Z

recipes/dev/grpo_full_finetune_distributed.py

-        return sampler, dataloader
+        if dataloader_state_dict is not None:
+            dataloader.load_state_dict(dataloader_state_dict)
+            list(dataloader)  # Hack to force dataloader to finish last iteration


Thanks for including StatefulDataloader in these recipes.
I have a couple of comments.

Let's add a more descriptive comment, something like "Since we break early with complete the last epoch and we want to start a new epoch when restarting train, we need to yield the remaining batches from the last epoch break breaking". Doesn't need to be this long, but something like this, as someone who doesn't work with StatefulDataLoader might get confused here.

Do you think we need a flag (like finish_current_epoch or some other name) that we set when we break early from an epoch, and if that flag is True (we set it if we break early) only then we complete the epoch before restarting training.

Re 1 - yeah definitely makes sense let me add it.

Re 2 - b/c right now we always save at epoch boundaries, there will never be a time when we don't want to complete the epoch before restarting training. This will change very very soon as we start mid-epoch checkpointing.

ramanishsingh

Thanks for adding the comment on finishing the DL epoch upon a restart.
LGTM!

codecov-commenter · 2025-02-26T19:22:57Z

Codecov Report

Attention: Patch coverage is 0% with 41 lines in your changes missing coverage. Please review.

Project coverage is 65.37%. Comparing base (cf0142b) to head (634be31).
Report is 178 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	13 Missing ⚠️
recipes/lora_finetune_distributed.py	0.00%	13 Missing ⚠️
recipes/lora_finetune_single_device.py	0.00%	10 Missing ⚠️
tests/recipes/test_lora_finetune_single_device.py	0.00%	2 Missing ⚠️
...htune/training/checkpointing/_checkpoint_client.py	0.00%	2 Missing ⚠️
recipes/full_finetune_single_device.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2431      +/-   ##
==========================================
- Coverage   65.38%   65.37%   -0.02%     
==========================================
  Files         374      374              
  Lines       22172    22189      +17     
==========================================
+ Hits        14498    14505       +7     
- Misses       7674     7684      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joecummings added 12 commits February 18, 2025 08:39

Add core dependency on stable torchdata

481fad0

Add support for SDL in single device full finetune

fe0380d

Type nonsense

72bc377

StatefulDataLoader

bf17f90

Update all values for worker=0

38ceeae

Sanity check

238f702

Break out of loop before starting next batch

7daa21d

Merge branch 'main' into add-support-for-stateful-dl

33e62f4

Harcode iterator_finished to True

2c8be38

Order of operations bud

c7decd7

Merge remote-tracking branch 'upstream/main' into scatter-stateful-dl

36af064

Add StatefulDataLoader to rest of recipes

f8fedf9

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 24, 2025

joecummings added 13 commits February 24, 2025 15:51

Update test w/ comment

7e7a9d3

Set the current epoch within sampler

ce9d2e8

Correct typing

8f7bb75

LoRA finetune single device w/ StatefulDL

de748ee

LoRA finetune distributed

7534eb5

Add stateful dataloading to GRPO

1890702

Use hack to force dataloader to finish iterating

7f8faf2

Update lora single device to use list hack

93aefd5

Update full finetune distributed to use list hack

7d15f51

Update full finetune single device to use list hack

70da034

Remove Tuple imports now that we don't save the sampler

cc3d6db

Hack to force iterator to finish last iteration w/ GRPO

b7c3fbe

Remove comment from tests

ee831b5

joecummings marked this pull request as ready for review February 26, 2025 16:52

joecummings requested review from ebsmothers and felipemello1 February 26, 2025 16:52

joecummings requested a review from ramanishsingh February 26, 2025 16:52

joecummings mentioned this pull request Feb 26, 2025

Add StatefulDataloader to remainder of recipes #2439

Closed

8 tasks

joecummings changed the title ~~Add StatefulDataLoader to rest of recipes~~ Add StatefulDataLoader to select other recipes Feb 26, 2025

joecummings changed the title ~~Add StatefulDataLoader to select other recipes~~ Add StatefulDataLoader to select other recipes Feb 26, 2025

ramanishsingh reviewed Feb 26, 2025

View reviewed changes

Make comments more descriptive

634be31

felipemello1 approved these changes Feb 26, 2025

View reviewed changes

ramanishsingh approved these changes Feb 26, 2025

View reviewed changes

joecummings merged commit 4d9840c into meta-pytorch:main Feb 26, 2025
17 checks passed

joecummings deleted the scatter-stateful-dl branch February 26, 2025 19:39

krammnic mentioned this pull request Feb 26, 2025

[WIP] Add StatefulDataLoader to all recipes except knowledge_single #2441

Merged

13 tasks

joecummings mentioned this pull request Feb 27, 2025

add StatefulDataLoader to knowledge_distillation_single_device recipe #2442

Merged

13 tasks

joecummings added a commit to joecummings/torchtune that referenced this pull request Feb 27, 2025

Add StatefulDataLoader to select other recipes (meta-pytorch#2431)

e5c36a4

joecummings added a commit to joecummings/torchtune that referenced this pull request Feb 27, 2025

Add StatefulDataLoader to select other recipes (meta-pytorch#2431)

52d1d0c

krammnic mentioned this pull request Mar 12, 2025

Remove dataloader state dict loading #2490

Merged

4 tasks

pbontrager pushed a commit to pbontrager/torchtune that referenced this pull request Mar 17, 2025

Add StatefulDataLoader to select other recipes (meta-pytorch#2431)

f0ec977

pbontrager pushed a commit that referenced this pull request Mar 17, 2025

Add StatefulDataLoader to select other recipes (#2431)

74e40d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `StatefulDataLoader` to select other recipes #2431

Add `StatefulDataLoader` to select other recipes #2431

Uh oh!

joecummings commented Feb 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 24, 2025 •

edited

Loading

Uh oh!

ramanishsingh Feb 26, 2025

Uh oh!

joecummings Feb 26, 2025

Uh oh!

ramanishsingh left a comment

Uh oh!

codecov-commenter commented Feb 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add StatefulDataLoader to select other recipes #2431

Add StatefulDataLoader to select other recipes #2431

Uh oh!

Conversation

joecummings commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2431

⏳ No Failures, 1 Pending

Uh oh!

ramanishsingh Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

ramanishsingh left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add `StatefulDataLoader` to select other recipes #2431

Add `StatefulDataLoader` to select other recipes #2431

joecummings commented Feb 24, 2025 •

edited

Loading

pytorch-bot bot commented Feb 24, 2025 •

edited

Loading

codecov-commenter commented Feb 26, 2025 •

edited

Loading