Alambic is a comprehensive microservices-based platform for biological metadata processing, combining natural language processing (NLP) capabilities with metadata fetching from NCBI databases. This system provides a unified API gateway that manages two core services: a NLP service for entity recognition in biomedical text and a metadata fetch service that retrieves structured data from biological databases.
- Overview
- Architecture
- Services
- Prerequisites
- Installation
- Configuration
- Usage
- API Documentation
- License
Alambic combines two powerful services into a unified platform:
- Fetch service - Retrieves metadata from NCBI databases (SRA, BioSample, etc.) using the NCBI EUtils API
- NLP - Processes biomedical text to identify named entities such as genes, diseases, species, cell lines, variants, and chemicals
These services are exposed through a REST API gateway, making it easy to integrate biological metadata processing into your research or application workflows.
The system follows a microservices architecture with three main components:
- Dump Service: Handles metadata retrieval from NCBI databases
- NLP Service: Processes text for entity recognition
- Nginx Gateway: Routes requests to appropriate services and provides unified API access
All components are containerized using Docker for ease of deployment and scalability.
The metadata fetching service provides functionality to:
- Search NCBI databases using terms
- Fetch detailed metadata for specific NCBI IDs
- Process and normalize the returned XML data into structured formats
- Extract and organize download links for SRA data files
The service is built with FastAPI and provides both synchronous and asynchronous endpoints for efficient processing of metadata requests.
The NLP service uses advanced biomedical language models to:
- Identify named entities in text
- Categorize entities into predefined categories (Gene, Disease, Species, etc.)
- Process multiple text entries in batch mode
- Docker and Docker Compose
- At least 4GB of RAM for running the NLP models
- Internet connection for fetching metadata from NCBI
-
Clone the repository:
git clone https://github.com/shitohana/Alambic.git cd ezmetaserver
-
Download the AIONER pretrained models and unpack
pretrained_models.zip
. Move thepretrained_models
folder inezmetaserver/nlp/pretrained_models
. -
Build and start the services:
docker-compose up -d
-
Verify that all services are running:
docker-compose ps
The API will be available at http://localhost:9090
The NLP service configuration is located in nlp/instance/config.yaml
:
models:
aioner:
path: "/app/pretrained_models/AIONER/Bioformer-softmax-AIONER.h5"
checkpoint: "/app/pretrained_models/bioformer-cased-v1.0"
lowercase: false
model_type: 1
For higher rate limits when accessing NCBI databases, you can provide an API key through the API requests. Register for an NCBI API key at: https://www.ncbi.nlm.nih.gov/account/settings/
curl -X POST "http://localhost:9090/api/v1/dump/fetch" \
-H "Content-Type: application/json" \
-d '{
"terms": ["SARS-CoV-2", "human"],
"db": "sra",
"max_results": 10
}'
The dump service can be configured through API parameters, including:
- Database selection
- Rate limits
- Maximum results
- API key for higher rate limits
curl -X GET "http://localhost:9090/api/v1/dump/peek?term=SARS-CoV-2" \
-H "Content-Type: application/json"
curl -X POST "http://localhost:9090/api/v1/nlp/process" \
-H "Content-Type: application/json" \
-d '{
"entries": [
{
"id": "sample1",
"text": "PRMT5 deficiency enforces the transcriptional and epigenetic programs of Klrg1+CD8+ terminal effector T cells"
}
],
"model_type": "aioner"
}'
Interactive API documentation is available at:
- AlambicDump API: http://localhost:9090/api/v1/dump/docs
- AlambicNLP API: http://localhost:9090/api/v1/nlp/docs
POST /api/v1/dump/fetch
- Fetch metadata from NCBI databasesGET /api/v1/dump/peek
- Check record availability in NCBI databasesGET /api/v1/dump/health
- Check the health status of the dump service
POST /api/v1/nlp/process
- Process text to identify named entitiesGET /api/v1/nlp/health
- Check the health status of the NLP service
GET /health
- Check the health status of all services
Modify nginx/nginx.conf
to adjust routing, rate limiting, or add additional services.
-
Automated Metadata Enrichment: Process research abstracts to identify key biological entities, then automatically fetch related metadata from NCBI.
-
Dataset Building: Construct curated datasets by searching for specific biological terms and collecting their associated metadata.
-
Integration with Analysis Pipelines: Use as a component in bioinformatics workflows to augment raw data with contextual information.
-
Metadata Standardization: Extract entities from free-text descriptions and connect them to standard database identifiers.
This project is licensed under the MIT License - see the LICENSE file for details.
Parts of this project are based on AIONER, which is licensed under its own terms.
- NCBI E-utilities for providing the API to access biological databases
- Bioformer and AIONER for the pre-trained models used in entity recognition