Integrate Muon optimizer

The Muon optimizer has shown to be an efficient optimizer, potentially outpacing AdamW for LLM training. To quote [Essential AI](https://arxiv.org/abs/2505.02222) “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”
 
We'd love to accept a contribution of a canonical example of Muon in the torchtune library, specifically for our full SFT recipes (single device and multi GPU). 

## Artifacts

* An implementation of the Muon optimizer as a Pytorch Optimizer
* Any changes needed to the recipes to support the change for our feature set

## Acceptance Criteria

* Clean, well documented code with proper citations 
* Tests
* Logs comparing Muon to AdamW for text training
* Logs comparing Muon to AdamW for multimodal (image + text) training

## Resources

* [PyTorch implementation](https://medium.com/@kyeg/building-the-muon-optimizer-in-pytorch-a-geometric-approach-to-neural-network-optimization-17f4601be548)
* [Practical Efficiency of Muon for Pretraining](https://arxiv.org/abs/2505.02222)
* [Blog explaning Muon](https://kellerjordan.github.io/posts/muon/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate Muon optimizer #2725

Artifacts

Acceptance Criteria

Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Integrate Muon optimizer #2725

Description

Artifacts

Acceptance Criteria

Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions