A Python project that extracts entities and relationships from documents using Google's LangExtract library and stores them in Dgraph graph database.
- 🔍 Entity Extraction: Extract structured entities from unstructured text using large language models
- 🕸️ Relationship Mapping: Identify and map relationships between extracted entities
- 📊 Graph Storage: Store entities and relationships in Dgraph graph database
- 🔗 Flexible Models: Support for multiple LLM providers (Gemini, OpenAI, Ollama)
- 🚀 Batch Processing: Process multiple documents efficiently
- 📝 CLI Interface: Easy-to-use command-line interface
- 🎯 Configurable: Customizable entity types, relationship types, and extraction prompts
- Python 3.10+
- uv package manager
- Running Dgraph instance
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/yourusername/langextract-graph.git
cd langextract-graph
# Create virtual environment and install dependencies
uv sync
# Install in development mode
uv pip install -e .
The easiest way to run Dgraph is using Docker:
# Start Dgraph
docker run -it -p 8080:8080 -p 9080:9080 -p 8000:8000 dgraph/standalone:latest
Or follow the official Dgraph installation guide.
Create a .env
file in the project root:
# API Keys (choose based on your preferred model)
GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
# Dgraph Configuration
DGRAPH_HOST=localhost
DGRAPH_PORT=9080
DGRAPH_ZERO_PORT=5080
# Extraction Configuration
LANGEXTRACT_MODEL=gemini-2.5-flash
LANGEXTRACT_PROMPT="Extract entities, their properties, and relationships from the text"
# Logging
LOG_LEVEL=INFO
# Extract entities from direct text input
uv run langextract-graph extract --text "Apple Inc. was founded by Steve Jobs in Cupertino, California."
# Extract from a file
uv run langextract-graph extract --file document.txt
# Use custom entity types and relationships
uv run langextract-graph extract \
--file document.txt \
--entity-types "PERSON,COMPANY,LOCATION" \
--relationship-types "FOUNDED_BY,LOCATED_IN"
# Process all .txt files in a directory
uv run langextract-graph batch --directory ./documents --pattern "*.txt"
# Save results to JSON file
uv run langextract-graph batch \
--directory ./documents \
--output results.json
# View all entities
uv run langextract-graph query
# Filter by entity type
uv run langextract-graph query --entity-type PERSON
# View statistics
uv run langextract-graph stats
from langextract_graph import ExtractionPipeline, ExtractionConfig, DgraphConfig
# Configure extraction
extraction_config = ExtractionConfig(
model_id="gemini-2.5-flash",
prompt_description="Extract people, organizations, and their relationships",
entity_types=["PERSON", "ORGANIZATION", "LOCATION"],
relationship_types=["WORKS_FOR", "LOCATED_IN", "FOUNDED_BY"]
)
# Configure Dgraph
dgraph_config = DgraphConfig(
connection_string="dgraph://localhost:9080"
)
# Create pipeline
with ExtractionPipeline(extraction_config, dgraph_config) as pipeline:
# Process a document
result = pipeline.process_text(
text="Microsoft was founded by Bill Gates and Paul Allen in 1975.",
document_id="doc_1"
)
print(f"Extracted {result['extracted_entities']} entities")
print(f"Extracted {result['extracted_relationships']} relationships")
from langextract_graph.models import ExtractionConfig
# Custom extraction with examples
config = ExtractionConfig(
model_id="gemini-2.5-flash",
prompt_description="Extract scientific entities and relationships",
entity_types=["RESEARCHER", "INSTITUTION", "PUBLICATION", "CONCEPT"],
relationship_types=["AUTHORED", "AFFILIATED_WITH", "CITES", "STUDIES"],
examples=[
{
"text": "Dr. Smith from MIT published a paper on quantum computing.",
"entities": [
{"name": "Dr. Smith", "type": "RESEARCHER"},
{"name": "MIT", "type": "INSTITUTION"},
{"name": "quantum computing", "type": "CONCEPT"}
],
"relationships": [
{"source": "Dr. Smith", "target": "MIT", "type": "AFFILIATED_WITH"},
{"source": "Dr. Smith", "target": "quantum computing", "type": "STUDIES"}
]
}
],
temperature=0.1,
max_tokens=2000
)
langextract-graph/
├── src/langextract_graph/
│ ├── __init__.py # Package initialization
│ ├── models.py # Pydantic data models
│ ├── extractor.py # Entity extraction logic
│ ├── dgraph_client.py # Dgraph database client
│ ├── pipeline.py # Main processing pipeline
│ ├── cli.py # Command-line interface
│ └── config.py # Configuration management
├── examples/ # Example documents and scripts
├── tests/ # Unit tests
├── docs/ # Additional documentation
├── pyproject.toml # Project configuration
├── README.md # This file
└── .env # Environment variables (create this)
gemini-2.5-flash
(recommended)gemini-pro
gemini-pro-vision
Requires GOOGLE_API_KEY
environment variable.
gpt-4
gpt-4-turbo
gpt-3.5-turbo
Requires OPENAI_API_KEY
environment variable.
- Any model available through Ollama
- No API key required
- Example:
llama2
,codellama
,mistral
The project automatically sets up the following Dgraph schema:
type Entity {
entity.name: string
entity.type: string
entity.properties: string
entity.source_text: string
entity.confidence: float
entity.start_position: int
entity.end_position: int
}
type Document {
document.id: string
document.title: string
document.entities: [Entity]
document.metadata: string
document.extraction_timestamp: datetime
document.model_used: string
}
# Dynamic relationship predicates based on extraction
related_to: [uid] @reverse .
works_for: [uid] @reverse .
located_in: [uid] @reverse .
# ... additional predicates created automatically
Parameter | Type | Default | Description |
---|---|---|---|
model_id |
str | "gemini-2.5-flash" |
LLM model to use |
prompt_description |
str | "Extract entities..." |
Extraction prompt |
entity_types |
list | [] |
Target entity types |
relationship_types |
list | [] |
Target relationship types |
examples |
list | [] |
Few-shot examples |
temperature |
float | 0.1 |
Model temperature |
max_tokens |
int | None |
Max response tokens |
Parameter | Type | Default | Description |
---|---|---|---|
connection_string |
str | "dgraph://localhost:9080" |
Connection string |
alpha_host |
str | "localhost" |
Dgraph Alpha host |
alpha_port |
int | 9080 |
Dgraph Alpha port |
zero_host |
str | "localhost" |
Dgraph Zero host |
zero_port |
int | 5080 |
Dgraph Zero port |
use_tls |
bool | False |
Enable TLS |
drop_all |
bool | False |
Drop all data on init |
Extract entities from text or file and store in Dgraph.
uv run langextract-graph extract [OPTIONS]
Options:
--text, -t
: Text to extract entities from--file, -f
: File to process--output, -o
: Output file for results (JSON)--model, -m
: Model to use (default: gemini-2.5-flash)--dgraph-host
: Dgraph Alpha host (default: localhost)--dgraph-port
: Dgraph Alpha port (default: 9080)--entity-types
: Comma-separated entity types--relationship-types
: Comma-separated relationship types--prompt
: Custom extraction prompt--dry-run
: Extract without storing in Dgraph
Process multiple documents from a directory.
uv run langextract-graph batch [OPTIONS]
Options:
--directory, -d
: Directory containing documents (required)--pattern, -p
: File pattern to match (default: *.txt)--output, -o
: Output file for batch results (JSON)--model, -m
: Model to use--entity-types
: Target entity types--relationship-types
: Target relationship types
Query entities from Dgraph.
uv run langextract-graph query [OPTIONS]
Options:
--entity-type
: Filter by entity type--limit
: Maximum entities to display (default: 50)--dgraph-host
: Dgraph host--dgraph-port
: Dgraph port
Display statistics about extracted entities.
uv run langextract-graph stats [OPTIONS]
# Install with development dependencies
uv sync --group dev
# Install pre-commit hooks
uv run pre-commit install
# Run tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=langextract_graph
# Format code
uv run black src/ tests/
uv run isort src/ tests/
# Type checking
uv run mypy src/
Defined in pyproject.toml
:
# Run the CLI
uv run langextract-graph --help
# Run tests
uv run pytest
# Format code
uv run black .
uv run isort .
# Type checking
uv run mypy .
The project includes comprehensive error handling:
- API Errors: Graceful handling of LLM API failures
- Network Issues: Automatic retries for transient failures
- Dgraph Errors: Clear error messages for database issues
- Validation: Pydantic models ensure data integrity
- Logging: Structured logging with loguru
- Batch Processing: Process multiple documents efficiently
- Connection Pooling: Reuse Dgraph connections
- Memory Management: Stream large files to avoid memory issues
- Caching: Cache frequent queries (future enhancement)
- Parallel Processing: Process documents concurrently (future enhancement)
-
API Key Not Set
export GOOGLE_API_KEY=your_key_here # or add to .env file
-
Dgraph Connection Failed
# Check if Dgraph is running curl http://localhost:8080/health # Check ports in configuration uv run langextract-graph query --dgraph-host localhost --dgraph-port 9080
-
Model Not Found
# For Ollama models, ensure model is pulled ollama pull llama2
-
Import Errors
# Reinstall in development mode uv pip install -e .
Enable verbose logging:
uv run langextract-graph --verbose extract --file document.txt
Or set in environment:
export LOG_LEVEL=DEBUG
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Run tests (
uv run pytest
) - Format code (
uv run black . && uv run isort .
) - Commit changes (
git commit -m 'Add amazing feature'
) - Push to branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Google LangExtract for entity extraction
- Dgraph for graph database capabilities
- pydgraph for Python Dgraph client
- uv for fast Python package management
- Add support for more document formats (PDF, DOCX)
- Implement caching for extraction results
- Add graph visualization capabilities
- Support for custom extraction templates
- Real-time document processing
- Web interface for easier interaction
- Advanced query capabilities
- Export to other graph databases (Neo4j, etc.)
For questions, issues, or contributions:
- Open an issue
- Check the documentation
- Review examples
Made with ❤️ for the graph database and NLP community