This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
MedMiner leverages large language models (LLMs) and LangGraph to extract and analyze data from medical documents. The project uses a workflow-based architecture to process doctor's letters and extract structured medical information (e.g., medications, diagnoses).
This project uses a dev container with all necessary tools pre-configured. Use VS Code's "Reopen in Container" feature to get started.
Package Manager: uv (modern Python package manager)
# Run all tests with coverage
uv run pytest
# Run a single test file
uv run pytest tests/test_file.py
# Run a specific test function
uv run pytest tests/test_file.py::test_function_name# Format and lint code
uv run ruff check --fix
# Type checking
uv run ty check
# Run pre-commit hooks manually
pre-commit run --all-files# Build documentation
mkdocs build
# Serve documentation locally
mkdocs serveMedMiner uses a LangGraph-based workflow architecture where each extraction task is implemented as a compiled state graph with sequential node processing:
-
Base Workflow (src/medminer/workflows/base/workflow.py)
BaseWorkflow: Abstract base for all workflowsBaseExtractionWorkflow: Template for extraction tasks, builds a graph with: Extraction → Processing Node(s) → Storage
-
State Management (src/medminer/workflows/base/schema.py)
DoctorsLetterState: Base state with patient_id and letter textExtractuionState: Generic extraction state tracking raw/processed data and output path- States are TypedDict subclasses passed through the graph
-
Node Types (src/medminer/workflows/base/node.py)
InformationExtractor: Uses LLM with structured output to extract data from textBaseNodesubclasses: Custom processing nodes (e.g., RxNavLookup for medication enrichment)DataStorage: Final node that writes processed data to CSV
To add a new extraction workflow (e.g., for diagnoses, procedures):
- Create a new directory under src/medminer/workflows/ (e.g.,
diagnoses/) - Define schemas in
schema.py:- Raw extracted data TypedDict
- Processed data TypedDict (if different from raw)
- State class extending
ExtractuionState - Response format extending
ResponseFormat
- Implement processing nodes in
node.pyby subclassingBaseNode - Create workflow in
workflow.py:- Subclass
BaseExtractionWorkflow - Set
task_name,state_type,prompt,response_format - Add custom
process_nodesif needed (default isNoProcessing)
- Subclass
Example: See src/medminer/workflows/medications/ for a complete implementation that extracts medications and enriches them with RxNorm/ATC codes.
src/medminer/settings.py provides a singleton Settings class (cached) for runtime configuration:
register(key, value): Set configurationget(key, default): Retrieve configuration- Used for base_dir, split_patient flag, etc.
src/medminer/utils/models.py handles LLM provider configuration via environment variables:
- Reads
{PROVIDER}_*env vars (e.g.,OPENAI_API_KEY,OPENAI_MODEL) - Currently supports OpenAI provider
- Returns model parameters as dict
- Line length: 120 characters
- Python version: 3.13+
- Formatting: Ruff (similar to Black)
- Type hints: Required for all function definitions (mypy strict mode)
- String quotes: Double quotes
- Test files go in tests/ directory
- Use
pytestwith coverage enabled - Pytest cache and pytest itself configured in pyproject.toml
- Coverage reports generated as XML to
coverage.xml
Key dependencies:
- LangChain/LangGraph: LLM orchestration and graph-based workflows
- Pandas: Data processing and CSV output
- Gradio: UI components (in src/medminer/ui/)
- httpx: HTTP client for external APIs (e.g., RxNav)
See pyproject.toml for complete dependency list.