Gym

EXO Gym: Simulate distributed training on any hardware configuration, at any scale.

Simulate a GPU cluster with just your laptop! For example:

Use a laptop to simulate 4-node training of an image classification model using DiLoCo
Use a single node with 4x 4090 GPUs to simulate 16-node training of a language model using SPARTA

Why EXO Gym?

Simulate distributed training without setting up distributed clusters; avoid Kubernetes, Docker, and GPU hosting.
Fast iteration: implementing a new distributed training algo from scratch takes as little as 5 lines
Scale up number of nodes by changing a single parameter
Switch hardware from a laptop to a multi-GPU node - with no code changes

EXO Gym spins up multiple virtual PyTorch nodes on the hardware available. The virtual nodes train in parallel across the devices, and can communicate with PyTorch primitives such as all_reduce.

Supported Algorithms

AllReduce (Equivalent to PyTorch DDP)
FedAvg
DiLoCo
SPARTA
DeMo

... and anything else you can imagine! Implementing new algorithms with EXO Gym is very simple - see Custom Algorithms.

Installation

Dependencies

python>=3.10

Installation

To install:

git clone https://github.com/exo-explore/gym.git exogym
cd exogym
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Usage

Example Scripts

Example	Result
MNIST Comparison Compare DDP, DiLoCo, SPARTA on MNIST dataset. Runs in <2 mins on a M4 Mac Mini. `python example/mnist.py`
NanoGPT OpenWebText Train a NanoGPT-style transformer on the OpenWebText dataset. `python example/nanogpt_train.py --dataset owt --strategy diloco`
Shakespeare DiLoCo Scaling K How does DiLoCo compare for different device count (K)? This script compares DiLoCo for different device counts, normalized by FLOPs. `python example/diloco_scaling.py --dataset shakespeare` We can generate text with the model that we trained using: `python example/nanogpt/shakespeare_inference.py`

Custom Training

Strategies (eg. DiLoCo, SPARTA) are portable across domains. A custom dataset and model can be trained with a distributed algorithm like so:

from exogym import Trainer
from exogym.strategy.diloco import DiLoCoStrategy

train_dataset, val_dataset = ...
model = ... # model.forward() expects a batch, and returns a scalar loss

trainer = Trainer(model, train_dataset, val_dataset)

# Strategy for optimization & communication
strategy = DiLoCoStrategy(
  inner_optim='adam',
  H=100
)

trainer.fit(
  strategy=strategy,
  num_nodes=4,
  device='mps'
)

Custom Algorithms

example/playground.py is a minimal starting-point for writing new algorithms. For example, to implement gradient quantization from scratch:

class QuantizationStrategy(Strategy):
    def __init__(self, optim_spec, quantization_level: Literal['int8']):
        super().__init__()
        self.optim_spec = optim_spec
        self.scale = 0.024
        self.zero_point = 0
        self.qdtype = torch.uint8

    def step(self):
        for param in self.model.parameters():
            if param.grad is not None:
                quantized = torch.round(param.grad / self.scale + self.zero_point).clamp(0, 255).to(self.qdtype)
                
                q_wide = quantized.to(torch.int32)
                all_reduce(q_wide)
                
                param.grad = (q_wide.to(torch.float32) * self.scale) / self.num_nodes

        self.optim.step()
        super().step()

Supported Devices

CPU
CUDA
MPS (CPU-bound for copy operations, see here)

Technical Details

For further details on how EXO Gym works under-the-hood, please see docs/.

Citation

If you use EXO Gym in your research, please cite:

@software{exogym2025,
  title={EXO Gym},
  author={Matt Beton, Mohamed Baioumy, Matt Reed, Seth Howes, Alex Cheema},
  year={2025},
  url={https://github.com/exo-explore/gym}
}

Name		Name	Last commit message	Last commit date
Latest commit History 360 Commits
docs		docs
example		example
exogym		exogym
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
llms.txt		llms.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Gym

EXO Gym: Simulate distributed training on any hardware configuration, at any scale.

Why EXO Gym?

Supported Algorithms

Installation

Dependencies

Installation

Usage

Example Scripts

Custom Training

Custom Algorithms

Supported Devices

Technical Details

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

exo-explore/gym

Folders and files

Latest commit

History

Repository files navigation

Gym

EXO Gym: Simulate distributed training on any hardware configuration, at any scale.

Why EXO Gym?

Supported Algorithms

Installation

Dependencies

Installation

Usage

Example Scripts

Custom Training

Custom Algorithms

Supported Devices

Technical Details

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages