Developer guide

Setup

Thanks to Github codespaces you can spin up a working dev environment in your browser with just a click, no local setup required.

Set up your local environment with poetry:

cd transfermarkt-datasets
poetry install
poetry shell

This creates a virtual environment at .venv/ and installs all dependencies.

just

The justfile in the root defines a set of useful recipes. Some examples:

dvc_pull                       pull data from the cloud
docker_build                   build the project docker image and tag accordingly
acquire_local                  run the acquiring process locally (refreshes data/raw/<acquirer>)
prepare_local                  run the prep process locally (refreshes data/prep)
sync                           run the sync process (refreshes data frontends)
streamlit_local                run streamlit app locally

Run just --list to see the full list.

Data storage

All project data assets are kept inside the data folder. This is a DVC repository, so all files can be pulled from remote storage by running:

dvc pull

Data is stored in Cloudflare R2 and served via a public URL, so no credentials are needed for pulling.

To push data to the remote, you need R2 credentials configured as per-remote DVC config:

dvc remote modify --local r2 access_key_id <R2_ACCESS_KEY_ID>
dvc remote modify --local r2 secret_access_key <R2_SECRET_ACCESS_KEY>

This stores credentials in .dvc/config.local (gitignored).

path	description
`data/raw`	Raw data from different acquirers (check data acquisition below)
`data/prep`	Prepared datasets as produced by dbt (check data preparation)

Data acquisition

"Acquiring" is the process of collecting data from a specific source via an acquiring script. Acquired data lives in the data/raw folder.

Acquirers

An acquirer is a script that collects data from somewhere and puts it in data/raw. They are defined in the scripts/acquiring folder and run using the acquire_local recipe. For example:

just --set acquirer transfermarkt-api --set args "--season 2024" acquire_local

This populates data/raw/transfermarkt-api with the collected data. You can also run the script directly:

cd scripts/acquiring && python transfermarkt-api.py --season 2024

Data preparation

"Preparing" is the process of transforming raw data into a high-quality dataset. This is done in SQL using dbt and DuckDB.

cd dbt — the dbt folder contains the dbt project
dbt deps — install dbt packages (required the first time)
dbt run -m +appearances — refresh assets by running the corresponding model

Or use the prepare_local recipe from the repo root:

just prepare_local

dbt runs populate a dbt/duck.db file locally. Query it with Python (no DuckDB CLI required):

python -c "import duckdb; print(duckdb.connect('dbt/duck.db').sql('SELECT * FROM dev.games LIMIT 10').fetchdf())"

Frontends

Prepared data is published to popular dataset platforms by running just sync, which runs weekly as part of the data pipeline.

There is also a streamlit app with documentation, a data catalog, and sample analysis. Run it locally with:

just streamlit_local

Note: the app expects prepared data to exist in data/prep. Run dvc pull or just prepare_local first.

Infra

All cloud infrastructure is defined as code using Terraform in the infra folder.

Orchestration

The data pipeline is orchestrated as a series of Github Actions workflows defined in .github/workflows.

workflow name	triggers on	description
`build`*	Every push to `master` or an open pull request	Runs data preparation, tests, and commits a new version of the prepared data if there are changes
`acquire-<acquirer>.yml`	Schedule	Runs the acquirer and commits acquired data to the corresponding raw location
`sync-<frontend>.yml`	Every change on prepared data	Syncs prepared data to the corresponding frontend

*build-contribution is the same as build but without committing data.

Debugging workflows remotely is a pain. Use act to run them locally where possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Developer guide

Setup

just

Data storage

Data acquisition

Acquirers

Data preparation

Frontends

Infra

Orchestration

Uh oh!

FilesExpand file tree

developer-guide.md

Latest commit

History

developer-guide.md

File metadata and controls

Developer guide

Setup

just

Data storage

Data acquisition

Acquirers

Data preparation

Frontends

Infra

Orchestration