Skip to content

Integrate Muon optimizer #2725

@joecummings

Description

@joecummings

The Muon optimizer has shown to be an efficient optimizer, potentially outpacing AdamW for LLM training. To quote Essential AI “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”

We'd love to accept a contribution of a canonical example of Muon in the torchtune library, specifically for our full SFT recipes (single device and multi GPU).

Artifacts

  • An implementation of the Muon optimizer as a Pytorch Optimizer
  • Any changes needed to the recipes to support the change for our feature set

Acceptance Criteria

  • Clean, well documented code with proper citations
  • Tests
  • Logs comparing Muon to AdamW for text training
  • Logs comparing Muon to AdamW for multimodal (image + text) training

Resources

Metadata

Metadata

Assignees

Labels

community help wantedWe would love the community's help completing this issueenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions