Thanks to Github codespaces you can spin up a working dev environment in your browser with just a click, no local setup required.
Set up your local environment with poetry:
cd transfermarkt-datasets
poetry install
poetry shellThis creates a virtual environment at .venv/ and installs all dependencies.
The justfile in the root defines a set of useful recipes. Some examples:
dvc_pull pull data from the cloud
docker_build build the project docker image and tag accordingly
acquire_local run the acquiring process locally (refreshes data/raw/<acquirer>)
prepare_local run the prep process locally (refreshes data/prep)
sync run the sync process (refreshes data frontends)
streamlit_local run streamlit app locallyRun just --list to see the full list.
All project data assets are kept inside the data folder. This is a DVC repository, so all files can be pulled from remote storage by running:
dvc pullData is stored in Cloudflare R2 and served via a public URL, so no credentials are needed for pulling.
To push data to the remote, you need R2 credentials configured as per-remote DVC config:
dvc remote modify --local r2 access_key_id <R2_ACCESS_KEY_ID>
dvc remote modify --local r2 secret_access_key <R2_SECRET_ACCESS_KEY>This stores credentials in .dvc/config.local (gitignored).
| path | description |
|---|---|
data/raw |
Raw data from different acquirers (check data acquisition below) |
data/prep |
Prepared datasets as produced by dbt (check data preparation) |
"Acquiring" is the process of collecting data from a specific source via an acquiring script. Acquired data lives in the data/raw folder.
An acquirer is a script that collects data from somewhere and puts it in data/raw. They are defined in the scripts/acquiring folder and run using the acquire_local recipe. For example:
just --set acquirer transfermarkt-api --set args "--season 2024" acquire_localThis populates data/raw/transfermarkt-api with the collected data. You can also run the script directly:
cd scripts/acquiring && python transfermarkt-api.py --season 2024"Preparing" is the process of transforming raw data into a high-quality dataset. This is done in SQL using dbt and DuckDB.
cd dbt— the dbt folder contains the dbt projectdbt deps— install dbt packages (required the first time)dbt run -m +appearances— refresh assets by running the corresponding model
Or use the prepare_local recipe from the repo root:
just prepare_localdbt runs populate a dbt/duck.db file locally. Query it with Python (no DuckDB CLI required):
python -c "import duckdb; print(duckdb.connect('dbt/duck.db').sql('SELECT * FROM dev.games LIMIT 10').fetchdf())"Prepared data is published to popular dataset platforms by running just sync, which runs weekly as part of the data pipeline.
There is also a streamlit app with documentation, a data catalog, and sample analysis. Run it locally with:
just streamlit_localNote: the app expects prepared data to exist in
data/prep. Rundvc pullorjust prepare_localfirst.
All cloud infrastructure is defined as code using Terraform in the infra folder.
The data pipeline is orchestrated as a series of Github Actions workflows defined in .github/workflows.
| workflow name | triggers on | description |
|---|---|---|
build* |
Every push to master or an open pull request |
Runs data preparation, tests, and commits a new version of the prepared data if there are changes |
acquire-<acquirer>.yml |
Schedule | Runs the acquirer and commits acquired data to the corresponding raw location |
sync-<frontend>.yml |
Every change on prepared data | Syncs prepared data to the corresponding frontend |
*build-contribution is the same as build but without committing data.
Debugging workflows remotely is a pain. Use act to run them locally where possible.
