A GitHub Actions workflow that automatically tracks genome assemblies by cross-referencing eukaryote taxon data with the Earth BioGenome Project (EBP).
See the TSV file as a data table here
This repository contains an automated pipeline that:
- Monitors genome assemblies from the Earth BioGenome Project (EBP)
- Cross-references eukaryote assemblies with specific project accessions
- Creates a tracking matrix with new assemblies
The workflow runs automatically every day at midnight UTC and performs the following steps:
- Retrieve Eukaryote Assemblies: Gets all genome assemblies under eukaryote taxon ID 2759
- Retrieve Project Assemblies: Gets assemblies from specific EBP project accession PRJNA533106
- Cross-reference: Finds common assemblies between eukaryotes and the project
- Get Metadata: Retrieves detailed metadata for the cross-referenced assemblies
- Create Matrix: Creates the matrix
schedule:
- cron: '0 0 * * *' # Runs every day at midnight UTC
The workflow uses environment variables loaded from a .env
file:
MATRIX_PATH
: Path to the output matrix fileTSV_FIELDS
: Fields to extract from NCBI datasetsPROJECT_ACCESSION
: NCBI project accession to trackDATASET_EXTRA_ARGS
: Additional arguments for NCBI datasets queries
PROJECT_ACCESSION=PRJNA533106
ROOT_TAXON=2759
TSV_FIELDS=organism-name,organism-tax-id,accession,assminfo-name,assminfo-release-date,assminfo-biosample-accession,assminfo-bioproject,assminfo-bioproject-lineage-parent-accessions,source_database
MATRIX_PATH=./data/ebp-eukaryotes.tsv
Use the included test script to validate the pipeline locally:
chmod +x test_pipeline.sh
./test_pipeline.sh