Skip to content

johnymontana/langextract-graph

Repository files navigation

LangExtract Graph

A Python project that extracts entities and relationships from documents using Google's LangExtract library and stores them in Dgraph graph database.

Features

  • 🔍 Entity Extraction: Extract structured entities from unstructured text using large language models
  • 🕸️ Relationship Mapping: Identify and map relationships between extracted entities
  • 📊 Graph Storage: Store entities and relationships in Dgraph graph database
  • 🔗 Flexible Models: Support for multiple LLM providers (Gemini, OpenAI, Ollama)
  • 🚀 Batch Processing: Process multiple documents efficiently
  • 📝 CLI Interface: Easy-to-use command-line interface
  • 🎯 Configurable: Customizable entity types, relationship types, and extraction prompts

Installation

Prerequisites

  • Python 3.10+
  • uv package manager
  • Running Dgraph instance

Install with uv

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/yourusername/langextract-graph.git
cd langextract-graph

# Create virtual environment and install dependencies
uv sync

# Install in development mode
uv pip install -e .

Set up Dgraph

The easiest way to run Dgraph is using Docker:

# Start Dgraph
docker run -it -p 8080:8080 -p 9080:9080 -p 8000:8000 dgraph/standalone:latest

Or follow the official Dgraph installation guide.

Environment Variables

Create a .env file in the project root:

# API Keys (choose based on your preferred model)
GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

# Dgraph Configuration
DGRAPH_HOST=localhost
DGRAPH_PORT=9080
DGRAPH_ZERO_PORT=5080

# Extraction Configuration
LANGEXTRACT_MODEL=gemini-2.5-flash
LANGEXTRACT_PROMPT="Extract entities, their properties, and relationships from the text"

# Logging
LOG_LEVEL=INFO

Quick Start

1. Extract from Text

# Extract entities from direct text input
uv run langextract-graph extract --text "Apple Inc. was founded by Steve Jobs in Cupertino, California."

# Extract from a file
uv run langextract-graph extract --file document.txt

# Use custom entity types and relationships
uv run langextract-graph extract \
    --file document.txt \
    --entity-types "PERSON,COMPANY,LOCATION" \
    --relationship-types "FOUNDED_BY,LOCATED_IN"

2. Batch Process Documents

# Process all .txt files in a directory
uv run langextract-graph batch --directory ./documents --pattern "*.txt"

# Save results to JSON file
uv run langextract-graph batch \
    --directory ./documents \
    --output results.json

3. Query Extracted Data

# View all entities
uv run langextract-graph query

# Filter by entity type
uv run langextract-graph query --entity-type PERSON

# View statistics
uv run langextract-graph stats

Usage Examples

Python API

from langextract_graph import ExtractionPipeline, ExtractionConfig, DgraphConfig

# Configure extraction
extraction_config = ExtractionConfig(
    model_id="gemini-2.5-flash",
    prompt_description="Extract people, organizations, and their relationships",
    entity_types=["PERSON", "ORGANIZATION", "LOCATION"],
    relationship_types=["WORKS_FOR", "LOCATED_IN", "FOUNDED_BY"]
)

# Configure Dgraph
dgraph_config = DgraphConfig(
    connection_string="dgraph://localhost:9080"
)

# Create pipeline
with ExtractionPipeline(extraction_config, dgraph_config) as pipeline:
    # Process a document
    result = pipeline.process_text(
        text="Microsoft was founded by Bill Gates and Paul Allen in 1975.",
        document_id="doc_1"
    )
    
    print(f"Extracted {result['extracted_entities']} entities")
    print(f"Extracted {result['extracted_relationships']} relationships")

Advanced Configuration

from langextract_graph.models import ExtractionConfig

# Custom extraction with examples
config = ExtractionConfig(
    model_id="gemini-2.5-flash",
    prompt_description="Extract scientific entities and relationships",
    entity_types=["RESEARCHER", "INSTITUTION", "PUBLICATION", "CONCEPT"],
    relationship_types=["AUTHORED", "AFFILIATED_WITH", "CITES", "STUDIES"],
    examples=[
        {
            "text": "Dr. Smith from MIT published a paper on quantum computing.",
            "entities": [
                {"name": "Dr. Smith", "type": "RESEARCHER"},
                {"name": "MIT", "type": "INSTITUTION"},
                {"name": "quantum computing", "type": "CONCEPT"}
            ],
            "relationships": [
                {"source": "Dr. Smith", "target": "MIT", "type": "AFFILIATED_WITH"},
                {"source": "Dr. Smith", "target": "quantum computing", "type": "STUDIES"}
            ]
        }
    ],
    temperature=0.1,
    max_tokens=2000
)

Project Structure

langextract-graph/
├── src/langextract_graph/
│   ├── __init__.py          # Package initialization
│   ├── models.py            # Pydantic data models
│   ├── extractor.py         # Entity extraction logic
│   ├── dgraph_client.py     # Dgraph database client
│   ├── pipeline.py          # Main processing pipeline
│   ├── cli.py              # Command-line interface
│   └── config.py           # Configuration management
├── examples/               # Example documents and scripts
├── tests/                 # Unit tests
├── docs/                  # Additional documentation
├── pyproject.toml         # Project configuration
├── README.md             # This file
└── .env                  # Environment variables (create this)

Supported Models

Gemini (Google)

  • gemini-2.5-flash (recommended)
  • gemini-pro
  • gemini-pro-vision

Requires GOOGLE_API_KEY environment variable.

OpenAI

  • gpt-4
  • gpt-4-turbo
  • gpt-3.5-turbo

Requires OPENAI_API_KEY environment variable.

Ollama (Local)

  • Any model available through Ollama
  • No API key required
  • Example: llama2, codellama, mistral

Dgraph Schema

The project automatically sets up the following Dgraph schema:

type Entity {
    entity.name: string
    entity.type: string  
    entity.properties: string
    entity.source_text: string
    entity.confidence: float
    entity.start_position: int
    entity.end_position: int
}

type Document {
    document.id: string
    document.title: string
    document.entities: [Entity]
    document.metadata: string
    document.extraction_timestamp: datetime
    document.model_used: string
}

# Dynamic relationship predicates based on extraction
related_to: [uid] @reverse .
works_for: [uid] @reverse .
located_in: [uid] @reverse .
# ... additional predicates created automatically

Configuration Options

Extraction Configuration

Parameter Type Default Description
model_id str "gemini-2.5-flash" LLM model to use
prompt_description str "Extract entities..." Extraction prompt
entity_types list [] Target entity types
relationship_types list [] Target relationship types
examples list [] Few-shot examples
temperature float 0.1 Model temperature
max_tokens int None Max response tokens

Dgraph Configuration

Parameter Type Default Description
connection_string str "dgraph://localhost:9080" Connection string
alpha_host str "localhost" Dgraph Alpha host
alpha_port int 9080 Dgraph Alpha port
zero_host str "localhost" Dgraph Zero host
zero_port int 5080 Dgraph Zero port
use_tls bool False Enable TLS
drop_all bool False Drop all data on init

CLI Commands

extract

Extract entities from text or file and store in Dgraph.

uv run langextract-graph extract [OPTIONS]

Options:

  • --text, -t: Text to extract entities from
  • --file, -f: File to process
  • --output, -o: Output file for results (JSON)
  • --model, -m: Model to use (default: gemini-2.5-flash)
  • --dgraph-host: Dgraph Alpha host (default: localhost)
  • --dgraph-port: Dgraph Alpha port (default: 9080)
  • --entity-types: Comma-separated entity types
  • --relationship-types: Comma-separated relationship types
  • --prompt: Custom extraction prompt
  • --dry-run: Extract without storing in Dgraph

batch

Process multiple documents from a directory.

uv run langextract-graph batch [OPTIONS]

Options:

  • --directory, -d: Directory containing documents (required)
  • --pattern, -p: File pattern to match (default: *.txt)
  • --output, -o: Output file for batch results (JSON)
  • --model, -m: Model to use
  • --entity-types: Target entity types
  • --relationship-types: Target relationship types

query

Query entities from Dgraph.

uv run langextract-graph query [OPTIONS]

Options:

  • --entity-type: Filter by entity type
  • --limit: Maximum entities to display (default: 50)
  • --dgraph-host: Dgraph host
  • --dgraph-port: Dgraph port

stats

Display statistics about extracted entities.

uv run langextract-graph stats [OPTIONS]

Development

Setup Development Environment

# Install with development dependencies
uv sync --group dev

# Install pre-commit hooks
uv run pre-commit install

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=langextract_graph

# Format code
uv run black src/ tests/
uv run isort src/ tests/

# Type checking
uv run mypy src/

Project Scripts

Defined in pyproject.toml:

# Run the CLI
uv run langextract-graph --help

# Run tests
uv run pytest

# Format code
uv run black .
uv run isort .

# Type checking
uv run mypy .

Error Handling

The project includes comprehensive error handling:

  • API Errors: Graceful handling of LLM API failures
  • Network Issues: Automatic retries for transient failures
  • Dgraph Errors: Clear error messages for database issues
  • Validation: Pydantic models ensure data integrity
  • Logging: Structured logging with loguru

Performance Considerations

  • Batch Processing: Process multiple documents efficiently
  • Connection Pooling: Reuse Dgraph connections
  • Memory Management: Stream large files to avoid memory issues
  • Caching: Cache frequent queries (future enhancement)
  • Parallel Processing: Process documents concurrently (future enhancement)

Troubleshooting

Common Issues

  1. API Key Not Set

    export GOOGLE_API_KEY=your_key_here
    # or add to .env file
  2. Dgraph Connection Failed

    # Check if Dgraph is running
    curl http://localhost:8080/health
    
    # Check ports in configuration
    uv run langextract-graph query --dgraph-host localhost --dgraph-port 9080
  3. Model Not Found

    # For Ollama models, ensure model is pulled
    ollama pull llama2
  4. Import Errors

    # Reinstall in development mode
    uv pip install -e .

Debugging

Enable verbose logging:

uv run langextract-graph --verbose extract --file document.txt

Or set in environment:

export LOG_LEVEL=DEBUG

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests (uv run pytest)
  5. Format code (uv run black . && uv run isort .)
  6. Commit changes (git commit -m 'Add amazing feature')
  7. Push to branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Roadmap

  • Add support for more document formats (PDF, DOCX)
  • Implement caching for extraction results
  • Add graph visualization capabilities
  • Support for custom extraction templates
  • Real-time document processing
  • Web interface for easier interaction
  • Advanced query capabilities
  • Export to other graph databases (Neo4j, etc.)

Support

For questions, issues, or contributions:


Made with ❤️ for the graph database and NLP community

About

Building entity resolved knowledge graphs with Google Langextract

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published