diffmoe

A mixture of experts (MOE) that is designed for vision transformers and diffusion transformers

diffmoe uses a batch pool of tokens, where experts are selected across an entire batch of tokens. This contrasts with token-choice MOE, where each token independantly selects the activated expert. The batch pool complicates training - batch statistics influence the experts selected during the forward pass. To resolve this, diffmoe trains a capacity predictor to ensure that during evaluation, the expert selection mimics the same expert selection dynamics as training. diffmoe is a major improvement. Information across image patches is spread out thinly - some image patches contain vital signals and should recieve extra compute (Experts) while other image patches are largely uninformative. diffmoe's batch pool allows allocating more capacity to complex samples.

An example of training a DiT to generate images of MNIST is in examples/

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
diffmoe		diffmoe
examples		examples
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

diffmoe

Mini DiT with a Vanilla MLP, 100 epochs, generated images, no cfg

Final validation loss 0.041

Mini DiT with a DiffMOE MLP, 16 experts, 100 epochs, generated images, no cfg

Final validation loss 0.0322

About

Uh oh!

Releases

Packages

Languages

License

theAdamColton/diffmoe

Folders and files

Latest commit

History

Repository files navigation

diffmoe

Mini DiT with a Vanilla MLP, 100 epochs, generated images, no cfg

Final validation loss 0.041

Mini DiT with a DiffMOE MLP, 16 experts, 100 epochs, generated images, no cfg

Final validation loss 0.0322

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages