Alambic

Alambic is a comprehensive microservices-based platform for biological metadata processing, combining natural language processing (NLP) capabilities with metadata fetching from NCBI databases. This system provides a unified API gateway that manages two core services: a NLP service for entity recognition in biomedical text and a metadata fetch service that retrieves structured data from biological databases.

Overview

Alambic combines two powerful services into a unified platform:

Fetch service - Retrieves metadata from NCBI databases (SRA, BioSample, etc.) using the NCBI EUtils API
NLP - Processes biomedical text to identify named entities such as genes, diseases, species, cell lines, variants, and chemicals

These services are exposed through a REST API gateway, making it easy to integrate biological metadata processing into your research or application workflows.

Architecture

The system follows a microservices architecture with three main components:

Dump Service: Handles metadata retrieval from NCBI databases
NLP Service: Processes text for entity recognition
Nginx Gateway: Routes requests to appropriate services and provides unified API access

All components are containerized using Docker for ease of deployment and scalability.

Services

AlambicDump (Port 8001)

The metadata fetching service provides functionality to:

Search NCBI databases using terms
Fetch detailed metadata for specific NCBI IDs
Process and normalize the returned XML data into structured formats
Extract and organize download links for SRA data files

The service is built with FastAPI and provides both synchronous and asynchronous endpoints for efficient processing of metadata requests.

AlambicNLP (Port 8000)

The NLP service uses advanced biomedical language models to:

Identify named entities in text
Categorize entities into predefined categories (Gene, Disease, Species, etc.)
Process multiple text entries in batch mode

Prerequisites

Docker and Docker Compose
At least 4GB of RAM for running the NLP models
Internet connection for fetching metadata from NCBI

Installation

Clone the repository:

git clone https://github.com/shitohana/Alambic.git
cd ezmetaserver

Download the AIONER pretrained models and unpack pretrained_models.zip. Move the pretrained_models folder in ezmetaserver/nlp/pretrained_models.
Build and start the services:
```
docker-compose up -d
```
Verify that all services are running:
```
docker-compose ps
```

The API will be available at http://localhost:9090

Configuration

NLP Service Configuration

The NLP service configuration is located in nlp/instance/config.yaml:

models:
  aioner:
    path: "/app/pretrained_models/AIONER/Bioformer-softmax-AIONER.h5"
    checkpoint: "/app/pretrained_models/bioformer-cased-v1.0"
    lowercase: false
    model_type: 1

NCBI API Configuration

For higher rate limits when accessing NCBI databases, you can provide an API key through the API requests. Register for an NCBI API key at: https://www.ncbi.nlm.nih.gov/account/settings/

Usage

Fetching Metadata from NCBI

curl -X POST "http://localhost:9090/api/v1/dump/fetch" \
  -H "Content-Type: application/json" \
  -d '{
    "terms": ["SARS-CoV-2", "human"],
    "db": "sra",
    "max_results": 10
  }'

The dump service can be configured through API parameters, including:

Database selection
Rate limits
Maximum results
API key for higher rate limits

Checking Record Availability

curl -X GET "http://localhost:9090/api/v1/dump/peek?term=SARS-CoV-2" \
  -H "Content-Type: application/json"

Processing Text with NLP

curl -X POST "http://localhost:9090/api/v1/nlp/process" \
  -H "Content-Type: application/json" \
  -d '{
    "entries": [
      {
        "id": "sample1",
        "text": "PRMT5 deficiency enforces the transcriptional and epigenetic programs of Klrg1+CD8+ terminal effector T cells"
      }
    ],
    "model_type": "aioner"
  }'

API Documentation

Interactive API documentation is available at:

AlambicDump API: http://localhost:9090/api/v1/dump/docs
AlambicNLP API: http://localhost:9090/api/v1/nlp/docs

Key Endpoints

AlambicDump Service

POST /api/v1/dump/fetch - Fetch metadata from NCBI databases
GET /api/v1/dump/peek - Check record availability in NCBI databases
GET /api/v1/dump/health - Check the health status of the dump service

AlambicNLP Service

POST /api/v1/nlp/process - Process text to identify named entities
GET /api/v1/nlp/health - Check the health status of the NLP service

Health Check

GET /health - Check the health status of all services

Gateway

Modify nginx/nginx.conf to adjust routing, rate limiting, or add additional services.

Extended Use Cases

Automated Metadata Enrichment: Process research abstracts to identify key biological entities, then automatically fetch related metadata from NCBI.
Dataset Building: Construct curated datasets by searching for specific biological terms and collecting their associated metadata.
Integration with Analysis Pipelines: Use as a component in bioinformatics workflows to augment raw data with contextual information.
Metadata Standardization: Extract entities from free-text descriptions and connect them to standard database identifiers.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Parts of this project are based on AIONER, which is licensed under its own terms.

Acknowledgements

NCBI E-utilities for providing the API to access biological databases
Bioformer and AIONER for the pre-trained models used in entity recognition

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dump		dump
nginx		nginx
nlp		nlp
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Alambic

Table of Contents

Overview

Architecture

Services

AlambicDump (Port 8001)

AlambicNLP (Port 8000)

Prerequisites

Installation

Configuration

NLP Service Configuration

NCBI API Configuration

Usage

Fetching Metadata from NCBI

Checking Record Availability

Processing Text with NLP

API Documentation

Key Endpoints

AlambicDump Service

AlambicNLP Service

Health Check

Gateway

Extended Use Cases

License

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

shitohana/Alambic

Folders and files

Latest commit

History

Repository files navigation

Alambic

Table of Contents

Overview

Architecture

Services

AlambicDump (Port 8001)

AlambicNLP (Port 8000)

Prerequisites

Installation

Configuration

NLP Service Configuration

NCBI API Configuration

Usage

Fetching Metadata from NCBI

Checking Record Availability

Processing Text with NLP

API Documentation

Key Endpoints

AlambicDump Service

AlambicNLP Service

Health Check

Gateway

Extended Use Cases

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages