-
Notifications
You must be signed in to change notification settings - Fork 668
Open
Labels
community help wantedWe would love the community's help completing this issueWe would love the community's help completing this issueenhancementNew feature or requestNew feature or request
Description
The Muon optimizer has shown to be an efficient optimizer, potentially outpacing AdamW for LLM training. To quote Essential AI “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”
We'd love to accept a contribution of a canonical example of Muon in the torchtune library, specifically for our full SFT recipes (single device and multi GPU).
Artifacts
- An implementation of the Muon optimizer as a Pytorch Optimizer
- Any changes needed to the recipes to support the change for our feature set
Acceptance Criteria
- Clean, well documented code with proper citations
- Tests
- Logs comparing Muon to AdamW for text training
- Logs comparing Muon to AdamW for multimodal (image + text) training
Resources
janeyx99, adheep04 and showgood163
Metadata
Metadata
Assignees
Labels
community help wantedWe would love the community's help completing this issueWe would love the community's help completing this issueenhancementNew feature or requestNew feature or request