π¦ A minimal pretraining trainer for LLMs β from scratch.
# For model training and the full pipeline
uv pip install 'krill[cuda]@git+https://github.com/minpeter/krill.git' --torch-backend=cu128
# For preprocessing tasks
uv pip install 'krill@git+https://github.com/minpeter/krill.git' --torch-backend=cpu
After installation, the CLI is available as both krill
and the shorthand kr
.
Krill is a minimalistic training framework for Large Language Models (LLMs) built from scratch with simplicity and flexibility in mind. It provides command-line tools and modular components to handle data preprocessing, tokenizer training, model training, inference, and dataset inspection.
- Modular CLI with commands for preprocessing, tokenizer training, model training, inference, and dataset inspection
- Support for Hugging Face Transformers, Accelerate, and Flash Attention
- Configurable via YAML files with validation using Pydantic
- Automatic environment optimizations (e.g., Flash Attention)
- Integration with Hugging Face Hub for model and tokenizer pushing
- Data collators optimized for language modeling and flash attention
- Customizable optimizers including Muon
Krill includes a comprehensive test suite to ensure reliability and correctness:
- Unit Tests: Test individual functions in isolation with mocks
- Integration Tests: Test actual file parsing and component integration
- End-to-End Tests: Test complete workflows with real training
# Run all fast tests (unit + integration)
pytest tests/ -m "not slow"
# Run unit tests only
pytest tests/test_resume_unit.py
# Run integration tests only
pytest tests/test_resume_integration.py
# Run all tests
pytest tests/
See tests/README.md for detailed testing information.
The integration test suite includes comprehensive end-to-end tests that verify the resume functionality with real remote repositories. These tests confirm that the checkpoint step for pretraining/krill-e2e-ci-pico
is correctly identified as 44.
-
Create a YAML configuration file (see Configuration section below):
# edit path/to/config.yaml
-
(Optional) Train your own tokenizer (before preprocessing):
krill train-tokenizer path/to/config.yaml # or use `kr train-tokenizer ...`
-
Preprocess your dataset:
krill preprocess path/to/config.yaml # or `kr preprocess ...`
-
Start model training:
krill train path/to/config.yaml --num-processes 2 # or `kr train ...`
Launches model training using Accelerate. Accepts extra accelerate launch
arguments.
Preprocesses datasets as specified in the YAML config.
Displays sample data and statistics after preprocessing.
Trains or fine-tunes a tokenizer based on datasets.
Runs interactive generation from a model ID or YAML config. The --inspect
flag is experimental and provides token-level entropy analysis for each generated output.
(Not implemented) Placeholder for model evaluation.
Krill uses a YAML configuration file validated by Pydantic (KrillConfig
). Example:
datasets:
- path: pretraining/tiny-korean-100k
split: train
text_column: text
tokenizer:
hub_id: pretraining/pico-tokenizer-32k
vocab_size: 32000
preprocess:
prepared_path: ./artifacts/pico-1k
sequence_len: 1024
min_length: 150
train:
hub_model_id: pretraining/pico-1k
output_dir: ./artifacts/models/pico-1k
num_epochs: 1
learning_rate: 1e-3
weight_decay: 0.01
optimizer: muon
muon_implementation: moonlight
micro_batch_size: 2048
gradient_accumulation_steps: 1
model_config_name: pico
# Resume training from checkpoints (optional)
# Options:
# - auto: Smart detection (local checkpoint first, then remote, fallback to scratch)
# - local: Resume from local checkpoint only (error if none found)
# - remote: Resume from remote checkpoint on Hugging Face Hub
# - true: Let Hugging Face auto-detect last local checkpoint
# - false: Start from scratch
# resume: auto
Refer to src/krill/utils/config.py
for full schema and defaults.
Contributions welcome! Please open issues or pull requests on GitHub.
This project is licensed under the Apache License 2.0. See the LICENSE file for full details.