Add core dependency on stable torchdata #2408
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What?
See title
Why?
We already have a few dev recipes that require torchdata support link, link. We wanted to properly test these before adding a new dependency since these are nice-to-haves, not necessarily requirements for most post-training work. However, we aim to support step-based checkpointing soon see here, which necessitates the abilities to resume training from steps within an epoch. For this, we need a dataloader that maintains state, which is already available and tested in torchdata -
StatefulDataLoader.FAQs?
The speed of new features of immediate interest to torchtune coming out of torchdata is low, therefore I don't think it's likely we need to utilize nightly support that often. In addition, the current way we handle PyTorch deps is a little cumbersome. The user has to install torch, torchao, and torchvision themselves before just installing the package. To add torchdata onto this for a required part of each recipe would be too much IMO. We should look at a future fix for this such as building a torchtune[nightly] package that automatically downloads the latest stuff from PyTorch packages. Not sure how feasible this is.
Currently, torchdata releases on the same cadence as PyTorch core. A faster release schedule would definitely be better for torchtune especially if we only depend on the stable torchdata package. cc @ramanishsingh
For one, having torchdata successfully integrated into torchtune would mean we can more easily integrate multi-dataset, iterable datasets, and multi-threaded sample packing. So it's not just one feature. But also,
StatefulDataLoadermakes our lives so much easier for step-based checkpointing and is much better tested than anything we would build so yes I do think it's worth it.cc @scotts @ebsmothers @pbontrager @divyanshk