microgpt.c

The most atomic way to train and run a GPT language model — in pure, dependency-free C.

A faithful port of Andrej Karpathy's micro-gpt.py (author), which is itself a minimal GPT implementation using a scalar autograd engine. The C version replaces the autograd with explicit forward/backward passes and manual gradient computation — the standard approach in production C/CUDA ML code.

What is this?

A single-file GPT trainer and text generator that fits in ~900 lines of C with zero external dependencies. It implements:

Transformer architecture: token & positional embeddings, multi-head causal self-attention, RMSNorm, squared ReLU MLP, residual connections
Training: full forward/backward pass, cross-entropy loss, Adam optimizer with linear LR warmdown
Inference: autoregressive generation with temperature sampling and KV cache

The model differences from GPT-2 are minor: LayerNorm → RMSNorm, no biases, GeLU → squared ReLU, no weight tying.

Quick Start

gcc -O2 -o microgpt microgpt.c -lm
./microgpt --num-steps 500

That's it. On first run it reads input.txt (one document per line — the included file has 32K baby names from ssa.gov). After 500 steps (~0.03s) it generates:

sample 1: kama
sample 2: kaena
sample 3: talen
sample 4: mazen
sample 5: delan
sample 6: ajoran
...

Options

Flag	Default	Description
`--n-embd`	16	Embedding dimension
`--n-layer`	1	Number of transformer layers
`--block-size`	8	Maximum sequence length
`--num-steps`	500	Training steps
`--n-head`	4	Number of attention heads
`--learning-rate`	0.01	Learning rate

Bigger model example:

./microgpt --n-embd 32 --n-layer 2 --n-head 4 --block-size 16 --num-steps 2000

Benchmarks

All benchmarks on MacBook Pro, Apple M1 Pro (8-core: 6P+2E, 16GB RAM, macOS 15.5) — default config (n_embd=16, n_layer=1, n_head=4, block_size=8, 500 steps).

C vs Python — 500 steps, 32K names dataset

	Python (autograd)	C (manual grad)
Training time	296.68s	0.02s
Mean loss (last 50)	2.4854	2.5090
Speedup	1×	~14,800×
Params	4,064	4,064
Vocab size	27	27
Dependencies	Python 3	libc + libm

~14,800× faster. The speedup comes from eliminating the Python autograd overhead — no Value object allocations, no computation graph construction, no topological sort for backward pass — just flat array math.

Scaling up — 2000 steps, bigger config (n_embd=32, n_layer=2, block_size=16)


Params	26,816
Training time	2.37s
Mean loss (last 50)	2.31
Samples	kallan, jamion, kyeran, kayna, brish, jauly

Results

microgpt.c — default config (500 steps)

vocab size: 27, num docs: 32033
num params: 4064
step    1 /  500 | loss 3.2769
step   50 /  500 | loss 2.4198
step  100 /  500 | loss 2.9955
step  250 /  500 | loss 1.9190
step  500 /  500 | loss 1.9742
mean loss last 50 steps: 2.5090
training time: 0.02s

--- inference ---
sample 1: kama
sample 2: kaena
sample 3: talen
sample 4: mazen
sample 5: delan
sample 6: ajoran
sample 7: avirie
sample 8: rarya
sample 9: anri
sample 10: mela
sample 11: shin
sample 12: nirele
sample 13: bisan
sample 14: tali
sample 15: rarya
sample 16: jama
sample 17: anicia
sample 18: tala
sample 19: yale
sample 20: amilely

microgpt.py — default config (500 steps)

vocab size: 27, num docs: 32033
num params: 4064
step    1 /  500 | loss 3.2627
step  500 /  500 | loss 2.3133
mean loss last 50 steps: 2.4854
training time: 296.68s

How it works

The Python original uses a scalar autograd engine (the Value class) that builds a dynamic computation graph for every operation, then walks it in reverse topological order to compute gradients. This is elegant and educational but extremely slow — every float becomes a heap-allocated object with pointers, closures, and graph metadata.

The C port replaces this with explicit forward and backward passes:

Parameters live in a single flat float[] array, addressed by precomputed offsets
Gradients live in a parallel float[] array
Forward pass computes activations and caches intermediate values needed for backprop
Backward pass manually implements the chain rule for each operation (linear, rmsnorm, softmax, squared ReLU)
KV cache stores keys/values for causal attention across time steps

This is the same approach used by llm.c, llama.cpp, and other production C/CUDA ML implementations.

References

micro-gpt.py — The original Python implementation by Andrej Karpathy
llm.c — LLM training in simple, raw C/CUDA
llama.cpp — LLM inference in C/C++
makemore — The names dataset source

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
input.txt		input.txt
microgpt.c		microgpt.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microgpt.c

What is this?

Quick Start

Options

Benchmarks

C vs Python — 500 steps, 32K names dataset

Scaling up — 2000 steps, bigger config (n_embd=32, n_layer=2, block_size=16)

Results

microgpt.c — default config (500 steps)

microgpt.py — default config (500 steps)

How it works

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

microgpt.c

What is this?

Quick Start

Options

Benchmarks

C vs Python — 500 steps, 32K names dataset

Scaling up — 2000 steps, bigger config (n_embd=32, n_layer=2, block_size=16)

Results

microgpt.c — default config (500 steps)

microgpt.py — default config (500 steps)

How it works

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages