LangExtract Graph

A Python project that extracts entities and relationships from documents using Google's LangExtract library and stores them in Dgraph graph database.

Features

🔍 Entity Extraction: Extract structured entities from unstructured text using large language models
🕸️ Relationship Mapping: Identify and map relationships between extracted entities
📊 Graph Storage: Store entities and relationships in Dgraph graph database
🔗 Flexible Models: Support for multiple LLM providers (Gemini, OpenAI, Ollama)
🚀 Batch Processing: Process multiple documents efficiently
📝 CLI Interface: Easy-to-use command-line interface
🎯 Configurable: Customizable entity types, relationship types, and extraction prompts

Installation

Prerequisites

Python 3.10+
uv package manager
Running Dgraph instance

Install with uv

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/yourusername/langextract-graph.git
cd langextract-graph

# Create virtual environment and install dependencies
uv sync

# Install in development mode
uv pip install -e .

Set up Dgraph

The easiest way to run Dgraph is using Docker:

# Start Dgraph
docker run -it -p 8080:8080 -p 9080:9080 -p 8000:8000 dgraph/standalone:latest

Or follow the official Dgraph installation guide.

Environment Variables

Create a .env file in the project root:

# API Keys (choose based on your preferred model)
GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

# Dgraph Configuration
DGRAPH_HOST=localhost
DGRAPH_PORT=9080
DGRAPH_ZERO_PORT=5080

# Extraction Configuration
LANGEXTRACT_MODEL=gemini-2.5-flash
LANGEXTRACT_PROMPT="Extract entities, their properties, and relationships from the text"

# Logging
LOG_LEVEL=INFO

Quick Start

1. Extract from Text

# Extract entities from direct text input
uv run langextract-graph extract --text "Apple Inc. was founded by Steve Jobs in Cupertino, California."

# Extract from a file
uv run langextract-graph extract --file document.txt

# Use custom entity types and relationships
uv run langextract-graph extract \
    --file document.txt \
    --entity-types "PERSON,COMPANY,LOCATION" \
    --relationship-types "FOUNDED_BY,LOCATED_IN"

2. Batch Process Documents

# Process all .txt files in a directory
uv run langextract-graph batch --directory ./documents --pattern "*.txt"

# Save results to JSON file
uv run langextract-graph batch \
    --directory ./documents \
    --output results.json

3. Query Extracted Data

# View all entities
uv run langextract-graph query

# Filter by entity type
uv run langextract-graph query --entity-type PERSON

# View statistics
uv run langextract-graph stats

Usage Examples

Python API

from langextract_graph import ExtractionPipeline, ExtractionConfig, DgraphConfig

# Configure extraction
extraction_config = ExtractionConfig(
    model_id="gemini-2.5-flash",
    prompt_description="Extract people, organizations, and their relationships",
    entity_types=["PERSON", "ORGANIZATION", "LOCATION"],
    relationship_types=["WORKS_FOR", "LOCATED_IN", "FOUNDED_BY"]
)

# Configure Dgraph
dgraph_config = DgraphConfig(
    connection_string="dgraph://localhost:9080"
)

# Create pipeline
with ExtractionPipeline(extraction_config, dgraph_config) as pipeline:
    # Process a document
    result = pipeline.process_text(
        text="Microsoft was founded by Bill Gates and Paul Allen in 1975.",
        document_id="doc_1"
    )
    
    print(f"Extracted {result['extracted_entities']} entities")
    print(f"Extracted {result['extracted_relationships']} relationships")

Advanced Configuration

from langextract_graph.models import ExtractionConfig

# Custom extraction with examples
config = ExtractionConfig(
    model_id="gemini-2.5-flash",
    prompt_description="Extract scientific entities and relationships",
    entity_types=["RESEARCHER", "INSTITUTION", "PUBLICATION", "CONCEPT"],
    relationship_types=["AUTHORED", "AFFILIATED_WITH", "CITES", "STUDIES"],
    examples=[
        {
            "text": "Dr. Smith from MIT published a paper on quantum computing.",
            "entities": [
                {"name": "Dr. Smith", "type": "RESEARCHER"},
                {"name": "MIT", "type": "INSTITUTION"},
                {"name": "quantum computing", "type": "CONCEPT"}
            ],
            "relationships": [
                {"source": "Dr. Smith", "target": "MIT", "type": "AFFILIATED_WITH"},
                {"source": "Dr. Smith", "target": "quantum computing", "type": "STUDIES"}
            ]
        }
    ],
    temperature=0.1,
    max_tokens=2000
)

Project Structure

langextract-graph/
├── src/langextract_graph/
│   ├── __init__.py          # Package initialization
│   ├── models.py            # Pydantic data models
│   ├── extractor.py         # Entity extraction logic
│   ├── dgraph_client.py     # Dgraph database client
│   ├── pipeline.py          # Main processing pipeline
│   ├── cli.py              # Command-line interface
│   └── config.py           # Configuration management
├── examples/               # Example documents and scripts
├── tests/                 # Unit tests
├── docs/                  # Additional documentation
├── pyproject.toml         # Project configuration
├── README.md             # This file
└── .env                  # Environment variables (create this)

Supported Models

Gemini (Google)

gemini-2.5-flash (recommended)
gemini-pro
gemini-pro-vision

Requires GOOGLE_API_KEY environment variable.

OpenAI

gpt-4
gpt-4-turbo
gpt-3.5-turbo

Requires OPENAI_API_KEY environment variable.

Ollama (Local)

Any model available through Ollama
No API key required
Example: llama2, codellama, mistral

Dgraph Schema

The project automatically sets up the following Dgraph schema:

type Entity {
    entity.name: string
    entity.type: string  
    entity.properties: string
    entity.source_text: string
    entity.confidence: float
    entity.start_position: int
    entity.end_position: int
}

type Document {
    document.id: string
    document.title: string
    document.entities: [Entity]
    document.metadata: string
    document.extraction_timestamp: datetime
    document.model_used: string
}

# Dynamic relationship predicates based on extraction
related_to: [uid] @reverse .
works_for: [uid] @reverse .
located_in: [uid] @reverse .
# ... additional predicates created automatically

Configuration Options

Extraction Configuration

Parameter	Type	Default	Description
`model_id`	str	`"gemini-2.5-flash"`	LLM model to use
`prompt_description`	str	`"Extract entities..."`	Extraction prompt
`entity_types`	list	`[]`	Target entity types
`relationship_types`	list	`[]`	Target relationship types
`examples`	list	`[]`	Few-shot examples
`temperature`	float	`0.1`	Model temperature
`max_tokens`	int	`None`	Max response tokens

Dgraph Configuration

Parameter	Type	Default	Description
`connection_string`	str	`"dgraph://localhost:9080"`	Connection string
`alpha_host`	str	`"localhost"`	Dgraph Alpha host
`alpha_port`	int	`9080`	Dgraph Alpha port
`zero_host`	str	`"localhost"`	Dgraph Zero host
`zero_port`	int	`5080`	Dgraph Zero port
`use_tls`	bool	`False`	Enable TLS
`drop_all`	bool	`False`	Drop all data on init

CLI Commands

extract

Extract entities from text or file and store in Dgraph.

uv run langextract-graph extract [OPTIONS]

Options:

--text, -t: Text to extract entities from
--file, -f: File to process
--output, -o: Output file for results (JSON)
--model, -m: Model to use (default: gemini-2.5-flash)
--dgraph-host: Dgraph Alpha host (default: localhost)
--dgraph-port: Dgraph Alpha port (default: 9080)
--entity-types: Comma-separated entity types
--relationship-types: Comma-separated relationship types
--prompt: Custom extraction prompt
--dry-run: Extract without storing in Dgraph

batch

Process multiple documents from a directory.

uv run langextract-graph batch [OPTIONS]

Options:

--directory, -d: Directory containing documents (required)
--pattern, -p: File pattern to match (default: *.txt)
--output, -o: Output file for batch results (JSON)
--model, -m: Model to use
--entity-types: Target entity types
--relationship-types: Target relationship types

query

Query entities from Dgraph.

uv run langextract-graph query [OPTIONS]

Options:

--entity-type: Filter by entity type
--limit: Maximum entities to display (default: 50)
--dgraph-host: Dgraph host
--dgraph-port: Dgraph port

stats

Display statistics about extracted entities.

uv run langextract-graph stats [OPTIONS]

Development

Setup Development Environment

# Install with development dependencies
uv sync --group dev

# Install pre-commit hooks
uv run pre-commit install

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=langextract_graph

# Format code
uv run black src/ tests/
uv run isort src/ tests/

# Type checking
uv run mypy src/

Project Scripts

Defined in pyproject.toml:

# Run the CLI
uv run langextract-graph --help

# Run tests
uv run pytest

# Format code
uv run black .
uv run isort .

# Type checking
uv run mypy .

Error Handling

The project includes comprehensive error handling:

API Errors: Graceful handling of LLM API failures
Network Issues: Automatic retries for transient failures
Dgraph Errors: Clear error messages for database issues
Validation: Pydantic models ensure data integrity
Logging: Structured logging with loguru

Performance Considerations

Batch Processing: Process multiple documents efficiently
Connection Pooling: Reuse Dgraph connections
Memory Management: Stream large files to avoid memory issues
Caching: Cache frequent queries (future enhancement)
Parallel Processing: Process documents concurrently (future enhancement)

Troubleshooting

Common Issues

API Key Not Set

export GOOGLE_API_KEY=your_key_here
# or add to .env file

Dgraph Connection Failed

# Check if Dgraph is running
curl http://localhost:8080/health

# Check ports in configuration
uv run langextract-graph query --dgraph-host localhost --dgraph-port 9080

Model Not Found

# For Ollama models, ensure model is pulled
ollama pull llama2

Import Errors

# Reinstall in development mode
uv pip install -e .

Debugging

Enable verbose logging:

uv run langextract-graph --verbose extract --file document.txt

Or set in environment:

export LOG_LEVEL=DEBUG

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (uv run pytest)
Format code (uv run black . && uv run isort .)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Google LangExtract for entity extraction
Dgraph for graph database capabilities
pydgraph for Python Dgraph client
uv for fast Python package management

Roadmap

Add support for more document formats (PDF, DOCX)
Implement caching for extraction results
Add graph visualization capabilities
Support for custom extraction templates
Real-time document processing
Web interface for easier interaction
Advanced query capabilities
Export to other graph databases (Neo4j, etc.)

Support

For questions, issues, or contributions:

Open an issue
Check the documentation
Review examples

Made with ❤️ for the graph database and NLP community

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
examples		examples
src/langextract_graph		src/langextract_graph
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

License

johnymontana/langextract-graph

Folders and files

Latest commit

History

Repository files navigation

LangExtract Graph

Features

Installation

Prerequisites

Install with uv

Set up Dgraph

Environment Variables

Quick Start

1. Extract from Text

2. Batch Process Documents

3. Query Extracted Data

Usage Examples

Python API

Advanced Configuration

Project Structure

Supported Models

Gemini (Google)

OpenAI

Ollama (Local)

Dgraph Schema

Configuration Options

Extraction Configuration

Dgraph Configuration

CLI Commands

extract

batch

query

stats

Development

Setup Development Environment

Project Scripts

Error Handling

Performance Considerations

Troubleshooting

Common Issues

Debugging

Contributing

License

Acknowledgments

Roadmap

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages