Skip to content

feat: Add RAGAS evaluation framework for RAG quality assessment#2297

Merged
danielaskdd merged 20 commits intoHKUDS:mainfrom
anouar-bm:feat/ragas-evaluation
Nov 4, 2025
Merged

feat: Add RAGAS evaluation framework for RAG quality assessment#2297
danielaskdd merged 20 commits intoHKUDS:mainfrom
anouar-bm:feat/ragas-evaluation

Conversation

@anouar-bm
Copy link
Contributor

Description

This PR adds a comprehensive evaluation system using the RAGAS framework to assess LightRAG's retrieval and generation quality. The evaluation framework provides automated testing and quality metrics for RAG systems, enabling developers to measure and improve their RAG implementations.

Related Issues

Addresses the need for systematic RAG quality assessment and provides tools for continuous monitoring of LightRAG's performance.

Changes Made

New Files Added

  • lightrag/evaluation/eval_rag_quality.py (394 lines)

    • RAGEvaluator class with async evaluation support
    • Four key RAG quality metrics:
      • Faithfulness (0.0-1.0): Measures answer factual accuracy vs retrieved context
      • Answer Relevance (0.0-1.0): Evaluates query-response alignment
      • Context Recall (0.0-1.0): Assesses retrieval completeness
      • Context Precision (0.0-1.0): Measures retrieved context quality/noise ratio
    • HTTP API integration for testing live LightRAG instances
    • JSON and CSV report generation with timestamps
    • Configurable test datasets via JSON files
    • Async evaluation with progress tracking
  • lightrag/evaluation/README.md (309 lines)

    • Complete installation guide
    • Usage examples with code samples
    • Detailed metric interpretation
    • Troubleshooting section
    • Best practices for RAG evaluation
  • lightrag/evaluation/__init__.py (16 lines)

    • Package initialization with version info

Modified Files

  • pyproject.toml

    • Added optional evaluation dependencies group:
      • ragas>=0.3.7 (RAG assessment framework)
      • datasets>=4.3.0 (HuggingFace datasets for RAGAS)
      • httpx>=0.28.1 (HTTP client for API testing)
      • pytest>=8.4.2 (testing framework)
      • pytest-asyncio>=1.2.0 (async test support)
  • .gitignore

    • Added lightrag/evaluation/results/ to exclude generated evaluation reports

Installation

# Install with evaluation dependencies
pip install lightrag-hku[evaluation]

Usage Example

from lightrag.evaluation.eval_rag_quality import RAGEvaluator

# Initialize evaluator
evaluator = RAGEvaluator(
    api_base_url="http://localhost:8000",
    test_dataset_path="lightrag/evaluation/test_dataset.json",
    output_dir="lightrag/evaluation/results"
)

# Run evaluation
results = await evaluator.evaluate_all()

# View results
print(f"Average Faithfulness: {results['avg_faithfulness']:.2f}")
print(f"Average Answer Relevance: {results['avg_answer_relevance']:.2f}")
print(f"Average Context Recall: {results['avg_context_recall']:.2f}")
print(f"Average Context Precision: {results['avg_context_precision']:.2f}")

Checklist

  • Changes tested locally with sample datasets
  • Code reviewed for quality and best practices
  • Documentation updated (comprehensive README included)
  • Unit tests added (evaluation framework with test dataset)
  • Dependencies properly declared as optional

Additional Notes

Design Principles

This evaluation framework is designed to be:

  • Generic: Works with any LightRAG instance via HTTP API
  • Non-intrusive: Zero changes to core LightRAG code
  • Optional: Installed separately via pip install lightrag-hku[evaluation]
  • Extensible: Easy to add custom metrics or test datasets
  • Production-ready: Async evaluation, error handling, progress tracking

Why RAGAS?

RAGAS is a widely-adopted framework in the RAG community that provides:

  • Standardized metrics for comparing RAG systems
  • Research-backed evaluation methodology
  • Active maintenance and community support
  • Integration with popular ML frameworks

Use Cases

  • Development: Measure improvements during RAG system development
  • CI/CD: Automated quality checks in deployment pipelines
  • A/B Testing: Compare different RAG configurations
  • Monitoring: Track RAG quality over time in production

Performance

  • Async evaluation for concurrent test execution
  • Caching support to avoid re-processing
  • Configurable batch sizes
  • Progress tracking for long-running evaluations

Thank you for reviewing this contribution!

This contribution adds a comprehensive evaluation system using the RAGAS
framework to assess LightRAG's retrieval and generation quality.

Features:
- RAGEvaluator class with four key metrics:
  * Faithfulness: Answer accuracy vs context
  * Answer Relevance: Query-response alignment
  * Context Recall: Retrieval completeness
  * Context Precision: Retrieved context quality
- HTTP API integration for live system testing
- JSON and CSV report generation
- Configurable test datasets
- Complete documentation with examples
- Sample test dataset included

Changes:
- Added lightrag/evaluation/eval_rag_quality.py (RAGAS evaluator implementation)
- Added lightrag/evaluation/README.md (comprehensive documentation)
- Added lightrag/evaluation/__init__.py (package initialization)
- Updated pyproject.toml with optional 'evaluation' dependencies
- Updated .gitignore to exclude evaluation results directory

Installation:
pip install lightrag-hku[evaluation]

Dependencies:
- ragas>=0.3.7
- datasets>=4.3.0
- httpx>=0.28.1
- pytest>=8.4.2
- pytest-asyncio>=1.2.0
Test cases with generic examples about:
- LightRAG framework features and capabilities
- RAG system architecture and components
- Vector database support (ChromaDB, Neo4j, Milvus, etc.)
- LLM provider integrations (OpenAI, Anthropic, Ollama, etc.)
- RAG evaluation metrics explanation
- Deployment options (Docker, FastAPI, direct integration)
- Knowledge graph-based retrieval concepts

Changes:
- Added generic test_dataset.json with 8 LightRAG-focused test cases
- File added with git add -f to override test_* pattern

This provides realistic, reusable examples for users testing their
LightRAG deployments and helps demonstrate the evaluation framework.
@anouar-bm
Copy link
Contributor Author

Update: Added Generic Test Dataset

I've updated the PR to include a **generic ** file with LightRAG-focused examples.

What's Included

The test dataset now contains 8 generic test cases covering:

  1. LightRAG Overview - What is LightRAG and what problem it solves
  2. RAG Architecture - Main components of a RAG system
  3. LightRAG Features - Improvements over traditional RAG approaches
  4. Vector Database Support - Supported storage backends (ChromaDB, Neo4j, Milvus, Qdrant, MongoDB, Redis)
  5. Evaluation Metrics - Key RAG quality metrics (Faithfulness, Answer Relevance, Context Recall/Precision)
  6. Deployment Options - Docker, FastAPI server, direct Python integration
  7. LLM Providers - Supported providers (OpenAI, Anthropic Claude, Ollama, Azure, AWS Bedrock)
  8. Knowledge Graph Retrieval - Purpose of graph-based RAG systems

Why This Matters

Immediately useful - Users can test the evaluation framework on their LightRAG deployments without creating custom datasets
Educational - Examples demonstrate LightRAG capabilities and best practices
Realistic - Tests cover actual LightRAG features and use cases

The test dataset serves as both documentation and a working example for the evaluation framework.

@danielaskdd
Copy link
Collaborator

The articles and corresponding test questions should be aligned, but it is unclear how the evaluation pipeline ensures this pairing. Additionally, there is no evidence of code within the evaluation program that uploads articles to LightRAG.

**Lint Fixes (ruff)**:
- Sort imports alphabetically (I001)
- Add blank line after import traceback (E302)
- Add trailing comma to dict literals (COM812)
- Reformat writer.writerow for readability (E501)

**Rename test_dataset.json → sample_dataset.json**:
- Avoids .gitignore pattern conflict (test_* is ignored)
- More descriptive name - it's a sample/template, not actual test data
- Updated all references in eval_rag_quality.py and README.md

Resolves lint-and-format CI check failure.
Addresses reviewer feedback about test dataset naming.
@anouar-bm
Copy link
Contributor Author

@danielaskdd Thank you for the feedback! I'd like to clarify the evaluation workflow.

📋 How It Works

The evaluation framework assumes documents are already indexed in LightRAG. It does NOT upload documents - it only tests the existing system.

Workflow:

  1. User Setup (manual): Upload documents to LightRAG via WebUI/API/filesystem
  2. Create Test Dataset: User creates sample_dataset.json with questions matching THEIR documents
  3. Run Evaluation: Framework queries the existing LightRAG instance via /query API (read-only)

🎯 Why No Upload Code?

Design Decision: Evaluation should test the existing RAG system, not modify it.

Benefits:

  • Separation of concerns (evaluation ≠ data management)
  • Real-world testing (test YOUR actual knowledge base)
  • Production safety (read-only access via /query)
  • Flexibility (users can populate via any method)

📦 About sample_dataset.json

Renamed from test_dataset.json (latest commit) to avoid .gitignore conflicts.

Purpose: Template/example showing JSON format
Content: Generic questions about LightRAG framework itself
NOT intended: To be used as-is - users must create their own test cases matching their documents

🔄 Latest Changes

  • ✅ Renamed test_dataset.jsonsample_dataset.json
  • ✅ Applied ruff formatting (lint checks pass)
  • ✅ Updated README with Prerequisites section

Key insight: The evaluator queries an EXISTING system. Users upload docs first, then create aligned test cases.

Let me know if you'd like more clarification!

@danielaskdd
Copy link
Collaborator

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

anouar-bm and others added 7 commits November 2, 2025 16:16
**Critical Fix: Contexts vs Ground Truth**
- RAGAS metrics now evaluate actual retrieval performance
- Previously: Used ground_truth as contexts (always perfect scores)
- Now: Uses retrieved documents from LightRAG API (real evaluation)

**Changes to generate_rag_response (lines 100-156)**:
- Remove unused 'context' parameter
- Change return type: Dict[str, str] → Dict[str, Any]
- Extract contexts as list of strings from references[].text
- Return 'contexts' key instead of 'context' (JSON dump)
- Add response.raise_for_status() for better error handling
- Add httpx.HTTPStatusError exception handler

**Changes to evaluate_responses (lines 180-191)**:
- Line 183: Extract retrieved_contexts from rag_response
- Line 190: Use [retrieved_contexts] instead of [[ground_truth]]
- Now correctly evaluates: retrieval quality, not ground_truth quality

**Impact on RAGAS Metrics**:
- Context Precision: Now ranks actual retrieved docs by relevance
- Context Recall: Compares ground_truth against actual retrieval
- Faithfulness: Verifies answer based on actual retrieved contexts
- Answer Relevance: Unchanged (question-answer relevance)

Fixes incorrect evaluation methodology. Based on RAGAS documentation:
- contexts = retrieved documents from RAG system
- ground_truth = reference answer for context_recall metric

References:
- https://docs.ragas.io/en/stable/concepts/components/eval_dataset/
- https://docs.ragas.io/en/stable/concepts/metrics/
…nrichment

Added efficient RAG evaluation system with optimized API calls and comprehensive benchmarking.

Key Features:
- Single API call per evaluation (2x faster than before)
- Parallel evaluation based on MAX_ASYNC environment variable
- Chunk content enrichment in /query endpoint responses
- Comprehensive benchmark statistics (moyennes)
- NaN-safe metric calculations

API Changes:
- Added include_chunk_content parameter to QueryRequest (backward compatible)
- /query endpoint enriches references with actual chunk content when requested
- No breaking changes - default behavior unchanged

Evaluation Improvements:
- Parallel execution using asyncio.Semaphore (respects MAX_ASYNC)
- Shared HTTP client with connection pooling
- Proper timeout handling (3min connect, 5min read)
- Debug output for context retrieval verification
- Benchmark statistics with averages, min/max scores

Results:
- Moyenne RAGAS Score: 0.9772
- Perfect Faithfulness: 1.0000
- Perfect Context Recall: 1.0000
- Perfect Context Precision: 1.0000
- Excellent Answer Relevance: 0.9087
Added comprehensive documentation for the new include_chunk_content parameter
that enables retrieval of actual chunk text content in API responses.

Documentation Updates:
- Added "Include Chunk Content in References" section to API README
- Explained use cases: RAG evaluation, debugging, citations, transparency
- Provided JSON request/response examples
- Clarified parameter interaction with include_references

OpenAPI/Swagger Examples:
- Added "Response with chunk content" example to /query endpoint
- Shows complete reference structure with content field
- Demonstrates realistic chunk text content

This makes the feature discoverable through:
1. API documentation (README.md)
2. Interactive Swagger UI (http://localhost:9621/docs)
3. Code examples for developers
@anouar-bm
Copy link
Contributor Author

Description

  • RAGAS evaluator now runs against the actual retrieved contexts, executes test cases concurrently, and routes diagnostics through lightrag.utils.logger.
  • /query can optionally enrich references with chunk content while avoiding quadratic string concatenation, and the API docs explain how to enable it.

    RAGAS evaluation currently expects LLM_BINDING=openai; other LLM providers haven’t been tested.

Changes Made

  • lightrag/evaluation/eval_rag_quality.py: concurrency, timeout constants, _is_nan guard, logger-based instrumentation, NaN-safe aggregation.
  • lightrag/api/routers/query_routes.py: collect repeated chunk content in lists and join once when enriching references.
  • lightrag/api/README.md: document the include_chunk_content flag with request/response examples.

Checklist

  • Changes tested locally
  • Uses OpenAI binding for RAGAS (other bindings untested)
  • Code reviewed
  • Documentation updated (README + Swagger example)
  • Unit tests added (not applicable for this change)

Additional Notes

  • Ran the evaluator on the portfolio dataset; results reported by the script:
    Latest uv run lightrag/evaluation/eval_rag_quality.py log (OpenAI binding):
    INFO: ✅ LLM_BINDING: openai
    INFO: 
    INFO: ======================================================================
    INFO: 🔍 RAGAS Evaluation - Using Real LightRAG API
    INFO: ======================================================================
    INFO: 📡 RAG API URL: http://localhost:9621 (default)
    INFO: ======================================================================
    INFO: 
    INFO: ======================================================================
    INFO: 🚀 Starting RAGAS Evaluation of Portfolio RAG System
    INFO: 🔧 Parallel evaluations: 2
    INFO: ======================================================================
    INFO: [1/3] Evaluating: What were the standout results from the Neural ODE dropout p...
    INFO: [2/3] Evaluating: Which capabilities define the multimodal 3D avatar agent bui...
    INFO: ✅ Faithfulness: 1.0000
    INFO: ✅ Answer Relevance: 0.9231
    INFO: ✅ Context Recall: 1.0000
    INFO: ✅ Context Precision: 1.0000
    INFO: 📊 RAGAS Score: 0.9808
    INFO: [3/3] Evaluating: How did the Launchpad internship improve the multi-tenant CR...
    INFO: ✅ Faithfulness: 1.0000
    INFO: ✅ Answer Relevance: 0.8664
    INFO: ✅ Context Recall: 1.0000
    INFO: ✅ Context Precision: 1.0000
    INFO: 📊 RAGAS Score: 0.9666
    INFO: ✅ Faithfulness: 1.0000
    INFO: ✅ Answer Relevance: 0.8940
    INFO: ✅ Context Recall: 1.0000
    INFO: ✅ Context Precision: 1.0000
    INFO: 📊 RAGAS Score: 0.9735
    INFO: ✅ JSON results saved to: lightrag/evaluation/results/results_20251102_193350.json
    INFO: ✅ CSV results saved to: lightrag/evaluation/results/results_20251102_193350.csv
    INFO: 
    INFO: ======================================================================
    INFO: 📊 EVALUATION COMPLETE
    INFO: ======================================================================
    INFO: Total Tests:    3
    INFO: Successful:     3
    INFO: Failed:         0
    INFO: Success Rate:   100.00%
    INFO: Elapsed Time:   77.90 seconds
    INFO: Avg Time/Test:  25.97 seconds
    INFO: 
    INFO: ======================================================================
    INFO: 📈 BENCHMARK RESULTS (Average)
    INFO: ======================================================================
    INFO: Average Faithfulness:      1.0000
    INFO: Average Answer Relevance:  0.8945
    INFO: Average Context Recall:    1.0000
    INFO: Average Context Precision: 1.0000
    INFO: Average RAGAS Score:       0.9736
    

@danielaskdd
Copy link
Collaborator

To facilitate evaluation and user testing, please also submit the corresponding source text for sample_dataset.json to the repository.

@danielaskdd
Copy link
Collaborator

Regarding the query interface's context return, please consider the following improvements:

  1. Since a single file may correspond to multiple chunks, the content field should be a list. Please update the documentation and examples accordingly to reflect this.
  2. The query_text_stream parameter should support the include_references option, just like query_text, to enable the client to receive reference information in the response.

BREAKING CHANGE: The `content` field in query response references is now
an array of strings instead of a concatenated string. This preserves
individual chunk boundaries when a single file has multiple chunks.

Changes:
- Update QueryResponse Pydantic model to accept List[str] for content
- Modify query_text endpoint to return content as list (query_routes.py:425)
- Modify query_text_stream endpoint to support chunk content enrichment
- Update OpenAPI schema and examples to reflect array structure
- Update API README with breaking change notice and migration guide
- Fix RAGAS evaluation to flatten chunk content lists
BREAKING CHANGE: content field is now List[str] instead of str

- Add ReferenceItem Pydantic model for type safety
- Update /query and /query/stream to return content as list
- Update OpenAPI schema and examples
- Add migration guide to API README
- Fix RAGAS evaluation to handle list format

Addresses PR HKUDS#2297 feedback. Tested with RAGAS: 97.37% score.
@anouar-bm
Copy link
Contributor Author

anouar-bm commented Nov 3, 2025

✅ Addressed All Feedback

Thank you for the review! All requested changes have been implemented:

1. ✅ Content Field Now Returns a List

Before:

{"reference_id": "1", "file_path": "doc.md", "content": "chunk1\n\nchunk2"}

After:

{"reference_id": "1", "file_path": "doc.md", "content": ["chunk1", "chunk2"]}

Files Changed:

  • lightrag/api/routers/query_routes.py:149-156 - Added ReferenceItem Pydantic model for type safety
  • lightrag/api/routers/query_routes.py:433 - /query endpoint returns list
  • lightrag/api/routers/query_routes.py:697 - /query/stream endpoint returns list

2. ✅ query_text_stream Supports include_chunk_content

Added chunk content enrichment to /query/stream endpoint. Both streaming and non-streaming modes now support include_chunk_content.


3. ✅ Updated Documentation

  • OpenAPI schema updated to show content as array (query_routes.py:211-215)
  • All examples updated (query_routes.py:241-264, 468-472)
  • API README with breaking change notice (lightrag/api/README.md:477-517)
  • RAGAS evaluation compatibility fix (lightrag/evaluation/eval_rag_quality.py:183-193)

🧪 RAGAS Validation

Tested with live RAGAS evaluation:

Average RAGAS Score:       97.37%  ✅
Average Faithfulness:      100%    ✅
Average Context Recall:    100%    ✅
Average Context Precision: 100%    ✅

Perfect context metrics confirm the list structure preserves chunk boundaries correctly.


📋 Migration Guide

To get a single string:

content_str = "\n\n".join(reference["content"])

📝 Test Data Note

Note on sample_dataset.json:
The test queries in sample_dataset.json were originally developed using my personal portfolio data (including projects like Neural ODE internships, multimodal avatar development, and CRM systems). All source documents are Markdown files detailing my professional experience.

For reproducibility, the version pushed to the repository is generalized, so anyone can test it using their own documents. Tests passed successfully with my personal data.


@danielaskdd
Copy link
Collaborator

Could you please remove the sensitive information and retain only the content related to the test questions, so we can compile a test document to facilitate users in reproducing the evaluation results?

@danielaskdd
Copy link
Collaborator

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 43 to 78
# Setup OpenAI API key (required for RAGAS evaluation)
# Use LLM_BINDING_API_KEY when running with the OpenAI binding

llm_binding = os.getenv("LLM_BINDING", "").lower()
llm_binding_key = os.getenv("LLM_BINDING_API_KEY")

# Validate LLM_BINDING is set to openai
if llm_binding != "openai":
logger.error(
"❌ LLM_BINDING must be set to 'openai'. Current value: '%s'",
llm_binding or "(not set)",
)
sys.exit(1)

# Validate LLM_BINDING_API_KEY exists
if not llm_binding_key:
logger.error("❌ LLM_BINDING_API_KEY is not set. Cannot run RAGAS evaluation.")
sys.exit(1)

# Set OPENAI_API_KEY from LLM_BINDING_API_KEY
os.environ["OPENAI_API_KEY"] = llm_binding_key
logger.info("✅ LLM_BINDING: openai")

try:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
context_precision,
context_recall,
faithfulness,
)
except ImportError as e:
logger.error("❌ RAGAS import error: %s", e)
logger.error(" Install with: pip install ragas datasets")
sys.exit(1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid exiting process at module import in evaluation script

The new RAGAS evaluator module performs environment validation and dependency checks at import time and calls sys.exit(1) when LLM_BINDING is not openai, when the API key is missing, or when ragas/datasets are not installed. Importing lightrag.evaluation.eval_rag_quality therefore terminates the entire Python process instead of raising an exception, making it impossible to programmatically import RAGEvaluator unless the environment is preconfigured exactly as expected. These checks should be moved into main()/run() and raise informative exceptions so that importing the module is side-effect free.

Useful? React with 👍 / 👎.

Comment on lines 1 to 16
"""
LightRAG Evaluation Module

RAGAS-based evaluation framework for assessing RAG system quality.

Usage:
from lightrag.evaluation.eval_rag_quality import RAGEvaluator

evaluator = RAGEvaluator()
results = await evaluator.run()

Note: RAGEvaluator is imported dynamically to avoid import errors
when ragas/datasets are not installed.
"""

__all__ = ["RAGEvaluator"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Provide RAGEvaluator in evaluation package namespace

The new lightrag.evaluation package sets __all__ = ["RAGEvaluator"] but never defines or imports RAGEvaluator. As a result from lightrag.evaluation import RAGEvaluator raises AttributeError, contradicting the module docstring and breaking expected usage. Re-export the class with a direct import or a lazy __getattr__ so that callers can import it from the package.

Useful? React with 👍 / 👎.

anouar-bm and others added 2 commits November 3, 2025 05:56
…ocumentation

Changes:
- Move sys.exit() calls from module level to __init__() method
- Raise proper exceptions (ImportError, ValueError, EnvironmentError) instead of sys.exit()
- Add lazy import for RAGEvaluator in __init__.py using __getattr__
- Update README to clarify sample_dataset.json contains generic test data (not personal)
- Fix README to reflect actual output format (JSON + CSV, not HTML)
- Improve documentation for custom test case creation

Addresses code review feedback about import-time validation and module exports.
@anouar-bm
Copy link
Contributor Author

Updates

✅ Changed content field from string to List[str] with ReferenceItem Pydantic model
✅ Added include_chunk_content support to /query/stream endpoint
✅ Fixed evaluation module: moved sys.exit() to runtime, added lazy imports, updated docs

Note: Test data in sample_dataset.json is generic (not personal) - users should customize based on their indexed documents.

@danielaskdd
Copy link
Collaborator

The user requires a test file that works in conjunction with sample_dataset.json to generate valid and accurate scores.

@anouar-bm
Copy link
Contributor Author

@danielaskdd The evaluation results I shared (97.37% RAGAS score) used my personal portfolio documents with customized entity extraction prompts for better results. Neither the documents nor custom prompts are in this repo.

The current sample_dataset.json contains generic LightRAG questions as a template. I can add matching sample markdown files (covering LightRAG features, architecture, vector DBs, deployment) so users can reproduce evaluation results with the default prompts. Should I create these?

Add 5 markdown documents that users can index to reproduce evaluation results.

Changes:
- Add sample_documents/ folder with 5 markdown files covering LightRAG features
- Update sample_dataset.json with 3 improved, specific test questions
- Shorten and correct evaluation README (removed outdated info about mock responses)
- Add sample_documents reference with expected ~95% RAGAS score

Test Results with sample documents:
- Average RAGAS Score: 95.28%
- Faithfulness: 100%, Answer Relevance: 96.67%
- Context Recall: 88.89%, Context Precision: 95.56%
@anouar-bm
Copy link
Contributor Author

✅ Added Sample Documents for Reproducible Testing

I've added 5 markdown documents that users can index to reproduce evaluation results.

Changes:

  • Added sample_documents/ folder with 5 markdown files covering LightRAG features
  • Updated sample_dataset.json with 3 improved, specific test questions
  • Shortened and corrected evaluation README (removed outdated info about mock responses)
  • Added sample_documents reference with expected ~95% RAGAS score

Test Results with sample documents:

  • Average RAGAS Score: 95.28%
  • Faithfulness: 100%, Answer Relevance: 96.67%
  • Context Recall: 88.89%, Context Precision: 95.56%

Users can now index these docs and reproduce the benchmark results.

@danielaskdd danielaskdd merged commit ad2d3c2 into HKUDS:main Nov 4, 2025
1 check passed
@danielaskdd
Copy link
Collaborator

Enhanced RAGAS Evaluation Framework

The updates significantly enhance compatibility with any OpenAI-compatible service (including custom endpoints via bypass_n mode) and introduce intelligent concurrency control to prevent rate limiting. Additionally, the update features robust NaN handling, comprehensive configuration options, and professional formatted output.

Before:

  • Compatibility issue with RAGAS 0.3.x, specifically concerning the answer_relevancy metric's default strictness parameter.
  • No concurrency control (frequent rate limiting)
  • Race condition may result in the injected LLM being set to None when two evaluations overlap
  • Basic RAGAS integration with hard-coded OpenAI dependency
  • Insufficient error handling for NaN values during average calculation
  • Minimal configuration options
  • Does not support LightRAG Server when API key authentication is required.

After:

  • Full compatibility with custom OpenAI-compatible endpoints via bypass_n mode
  • Intelligent concurrency control to prevent rate limiting
  • RAGAS run now builds fresh metric objects (Faithfulness(), AnswerRelevancy(), ContextRecall(), ContextPrecision()).
  • Full support for any OpenAI-compatible service ( LM Studio, vLLM, etc.)
  • Flexible model configuration with fallbacks
  • Robust NaN handling across all metrics calculation
  • Comprehensive environment-based configuration
  • Support LIGHTRAG_API_KEY auth header
  • Professional formatted output
  • Enviroment variable added (env.example)
Variable Default Purpose
EVAL_LLM_MODEL gpt-4o-mini LLM for RAGAS evaluation
EVAL_EMBEDDING_MODEL text-embedding-3-small Embedding model
EVAL_LLM_BINDING_API_KEY (falls back to OPENAI_API_KEY) API key
EVAL_LLM_BINDING_HOST (optional) Custom endpoint URL
EVAL_MAX_CONCURRENT 1 Concurrent evaluations
EVAL_QUERY_TOP_K 10 Documents per query
EVAL_LLM_MAX_RETRIES 5 Request retries
EVAL_LLM_TIMEOUT 120 Request timeout (seconds)
  • Renamed "context" field to "project" in Sample Dataset (sample_dataset.json)

🔧 Technical Details

  1. RAGAS Compatibility: RAGAS 0.3.x requires properly wrapped LLM instances
  2. Custom Endpoint Support: Many OpenAI-compatible services don't support all OpenAI API parameters
  3. The n Parameter Problem:
    • OpenAI's API supports n parameter to generate multiple completions in one request
    • Custom endpoints (Ollama, LM Studio, vLLM, etc.) often reject this parameter
    • RAGAS metrics like answer_relevancy internally use n for generating multiple variants
  4. The Solution: bypass_n=True makes RAGAS generate multiple outputs through repeated single prompts instead

@danielaskdd
Copy link
Collaborator

Thank you for your contribution to LightRAG. It has been a pleasure collaborating with you.

@anouar-bm
Copy link
Contributor Author

anouar-bm commented Nov 4, 2025

I will push later the following workflow that I use:

flowchart TD
    A[Query + Conversation History] --> B[Backend: Check if query needs RAG]
    
    B --> C{need_rag value?}
    
    C -->|true| D["JSON: {need_rag: true, query_enhanced: string}"]
    D --> E[Extract query_enhanced]
    E --> F[Process RAG: Retrieve relevant documents]
    F --> G[Generate response with RAG context]
    G --> H[Return RAG response]
    
    C -->|false| I["JSON: {need_rag: false, query_response: string}"]
    I --> J[Extract query_response]
    J --> K[Return query_response directly]
    
    H --> L[Final Response to User]
    K --> L[Final Response to User]

Loading

@danielaskdd
Copy link
Collaborator

LightRAG currently includes a bypass query mode that enables user queries to be directly forwarded to the LLM.

taddeusb90 added a commit to neuro-inc/LightRAG that referenced this pull request Nov 8, 2025
* docs: clarify docling exclusion in offline Docker image

* Bump core version to 1.4.9.4

* Add reminder note to manual Docker build workflow

* Change Docker build cache mode from max to min

• Reduce cache storage usage
• Try to fix GithHub Action failure

* Optimize Docker builds with layer caching and add pip for runtime installs

• Split deps and source code layers
• Add --no-editable flag to uv sync
• Install pip for runtime packages
• Improve build cache efficiency

* Update comments

* Remove torch and transformers from offline dependency groups

* Change default docker image to offline version

• Add lite verion docker image with tiktoken cache
• Update docs and build scripts

* docs: improve Docker build documentation with clearer notes

* Improve Docker build workflow with automated multi-arch script and docs

* Allow related chunks missing in knowledge graph queries

* remove deprecated dotenv package.

* Fix cache control error of index.html

• Retrun no-cache for all HTML responses not just .html files
• Prevent force browser refresh action after front-end rebuild

* Simplify Docker deployment documentation and improve clarity

* Remove dotenv dependency from project

* Improve API description formatting and add ReDoc link

* Quick fix to limit source_id ballooning while inserting nodes

* Import from env and use default if none and removed useless import

* Get max source Id config from .env and lightRAG init

* Fix tuple delimiter corruption handling in regex patterns

* Improve AsyncSelect layout and text overflow handling

- Add responsive width container
- Improve text truncation with tooltips

* Fix redoc access problem in front-end dev mode

- Add /redoc endpoint to proxy config
- Remove root path from API endpoints
- Add .env.development to git reopo
- Update sample environment files
- Refine .gitignore patterns for env files

* Optimize chat performance by reducing animations in inactive tabs

• Add isTabActive prop to ChatMessage
• Disable spinner in inactive tabs
• Reduce opacity for inactive content
• Hide loading indicator when inactive
• Pass tab state from RetrievalTesting

* Bump API version to 0241

* Refactor SQL queries and improve input handling in PGKVStorage and PGDocStatusStorage

* Update Swagger API key status description text

* Fix linting

* Add entity/relation chunk tracking with configurable source ID limits

- Add entity_chunks & relation_chunks storage
- Implement KEEP/FIFO limit strategies
- Update env.example with new settings
- Add migration for chunk tracking data
- Support all KV storage

* Add file path limit configuration for entities and relations

• Add MAX_FILE_PATHS env variable
• Implement file path count limiting
• Support KEEP/FIFO strategies
• Add truncation placeholder
• Remove old build_file_path function

* Fix logging message formatting

* Optimize PostgreSQL initialization performance

- Batch index existence checks into single query (16+ queries -> 1 query)
- Batch timestamp column checks into single query (8 queries -> 1 query)
- Batch field length checks into single query (5 queries -> 1 query)

Performance improvement: ~70-80% faster initialization (35s -> 5-10s)

Key optimizations:
1. check_tables(): Use ANY($1) to check all indexes at once
2. _migrate_timestamp_columns(): Batch all column type checks
3. _migrate_field_lengths(): Batch all field definition checks

All changes are backward compatible with no schema or API changes.
Reduces database round-trips by batching information_schema queries.

* Add truncation indicator and update property labels in graph view

• Add truncate tooltip to source_id field
• Add visual truncation indicator (†)
• Bump API version to 0242

* Track placeholders in file paths for accurate source count display

• Add has_placeholder tracking variable
• Detect placeholder patterns in paths
• Show + sign for truncated counts

* Update openai requirement from <2.0.0,>=1.0.0 to >=1.0.0,<3.0.0

Updates the requirements on [openai](https://github.com/openai/openai-python) to permit the latest version.
- [Release notes](https://github.com/openai/openai-python/releases)
- [Changelog](https://github.com/openai/openai-python/blob/main/CHANGELOG.md)
- [Commits](openai/openai-python@v1.0.0...v2.6.0)

---
updated-dependencies:
- dependency-name: openai
  dependency-version: 2.6.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Refactor entity/relation merge to consolidate VDB operations within functions

• Move VDB upserts into merge functions
• Fix early return data structure issues
• Update status messages (IGNORE_NEW → KEEP)
• Consolidate error handling paths
• Improve relationship content format

* Refactor deduplication calculation and remove unused variables

* Update truncation message format in properties tooltip

* Standardize placeholder format to use colon separator consistently

* Increase default limits for source IDs and file paths in metadata

• Entity source IDs: 3 → 300
• Relation source IDs: 3 → 300
• File paths: 2 → 30

* Simplify skip logging and reduce pipeline status updates

* Refactor node and edge merging logic with improved code structure

• Add numbered steps for clarity
• Improve early return handling
• Enhance file path limiting logic

* Improve file path truncation labels and UI consistency

• Standardize FIFO/KEEP truncation labels
• Update UI truncation text format

* Change default source IDs limit method from KEEP to FIFO

* Improve logging to show source ID ratios when skipping entities/edges

* Fix Redis data migration error

• Use proper Redis connection context
• Fix namespace pattern for key scanning
• Propagate storage check exceptions
• Remove defensive error swallowing

* Increase default max file paths from 30 to 100 and improve documentation

- Bump DEFAULT_MAX_FILE_PATHS to 100
- Add clarifying comment about display

* Improve formatting of limit method info in rebuild functions

* Preserve file path order by using lists instead of sets

* Update pandas requirement from <2.3.0,>=2.0.0 to >=2.0.0,<2.4.0

Updates the requirements on [pandas](https://github.com/pandas-dev/pandas) to permit the latest version.
- [Release notes](https://github.com/pandas-dev/pandas/releases)
- [Commits](pandas-dev/pandas@v2.0.0...v2.3.3)

---
updated-dependencies:
- dependency-name: pandas
  dependency-version: 2.3.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* fix(docs): correct typo "acivate" → "activate"

* Fix typo in 'equipment' in prompt.py

* Add optional LLM cache deletion when deleting documents

• Add delete_llm_cache parameter to API
• Collect cache IDs from text chunks
• Delete cache after graph operations
• Update UI with new checkbox option
• Add i18n translations for cache option

* Bump API version to 0243

* Handle cache deletion errors gracefully instead of raising exceptions

* Fix linting

* Add entity name length truncation with configurable limit

* Fix linting

* Improve entity identifier truncation warning message format

* Fix RayAnything compatible problem

• Use "preprocessed" to indicate multimodal processing is required
• Update DocProcessingStatus to process status convertion automatically
• Remove multimodal_processed from DocStatus enum value
• Update UI filter logic

* Fix dimension type comparison in Milvus vector field validation

• Convert dimensions to int for comparison
• Handle string vs int type mismatches

* Bump API version to 0244

* Update Docker deployment comments for LLM and embedding hosts

* Allow users to provide keywords with QueryRequest

* Add pipeline cancellation feature for graceful processing termination

• Add cancel_pipeline API endpoint
• Implement PipelineCancelledException
• Add cancellation checks in main loop
• Handle task cancellation gracefully
• Mark cancelled docs as FAILED

* Add cancellation check in delete loop

* Add pipeline cancellation feature with UI and i18n support

- Add cancelPipeline API endpoint
- Add cancel button to status dialog
- Update status response type
- Add cancellation UI translations
- Handle cancellation request states

* Improve error handling and add cancellation checks in pipeline

* Resolve lock leakage issue during user cancellation handling

• Change default log level to INFO
• Force enable error logging output
• Add lock cleanup rollback protection
• Handle LLM cache persistence errors
• Fix async task exception handling

* Simplify pipeline status dialog by consolidating message sections

• Remove separate latest message section
• Combine into single pipeline messages area
• Add overflow-x-hidden for better display
• Change break-words to break-all
• Update translations across all locales

* Add confirmation dialog for pipeline cancellation

* Remove separate retry button and merge functionality into scan button

* Bump core version to 1.4.9.5 and API version to 0245

* Improve lock logging with consistent messaging and debug levels

* Rename rebuild function name and improve relationship logging format

* Remove enable_logging parameter from data init lock call

* Optimize PostgreSQL graph queries to avoid Cypher overhead and complexity

• Replace Cypher with native SQL queries
• Fix O(N²) to O(E) performance issue
• Add error handling for parse failures
• Use direct table access pattern
• Eliminate Cartesian product joins

* Fix entity consistency in knowledge graph rebuilding and merging

• Sort src/tgt for consistent ordering
• Create missing nodes before edges
• Update entity chunks storage
• Pass entity_vdb to rebuild function
• Ensure entities exist in all storages

* Fix entity and relation chunk cleanup in deletion pipeline

• Delete from entity_chunks storage
• Delete from relation_chunks storage

* refactor: Qdrant Multi-tenancy (Include staged)

Signed-off-by: Anush008 <[email protected]>

* Enhance entity/relation editing with chunk tracking synchronization

• Add chunk storage sync to edit ops
• Implement incremental chunk ID updates
• Support entity renaming migrations
• Normalize relation keys consistently
• Preserve chunk references on edits

* Normalize entity order for undirected graph consistency

• Normalize entity pairs for storage
• Update API docs for undirected edges

* Add chunk tracking cleanup to entity/relation deletion and creation

• Clean up chunk storage on delete
• Track chunks in create operations
• Normalize relation keys consistently

* Refactor graph utils to use unified persistence callback

- Add _persist_graph_updates function
- Remove duplicate callback functions

* Fix entity merging to include target entity relationships

* Include target entity in collection
* Merge all relevant relationships
* Prevent relationship loss
* Fix merge completeness

* Refactor entity merging with unified attribute merge function

• Update GRAPH_FIELD_SEP comment clarity
• Deprecate merge_strategy parameter
• Unify entity/relation merge logic
• Add join_unique_comma strategy

* Fix relation deduplication logic and standardize log message prefixes

* Add chunk tracking support to entity merge functionality

- Pass chunk storages to merge function
- Merge relation chunk tracking data
- Merge entity chunk tracking data
- Delete old chunk tracking records
- Persist chunk storage updates

* Replace global graph DB lock with fine-grained keyed locking

• Use entity/relation-specific locks
• Lock multiple entities when needed

* Enable editing of entity_type field in node properties

* Bump API version to 0246

* Fix vector deletion logging to show actual deleted count

* Refactor entity edit and merge functions to support merge-on-rename

• Extract internal implementation helpers
• Add allow_merge parameter to aedit_entity
• Support merging when renaming to existing name
• Improve code reusability and modularity
• Maintain backward compatibility

* Add allow_merge parameter to entity update API endpoint

* feat: Improve entity merge and edit UX

- **API:** The `graph/entity/edit` endpoint now returns a detailed `operation_summary` for better client-side handling of update, rename, and merge outcomes.
- **Web UI:** Added an "auto-merge on rename" option. The UI now gracefully handles merge success, partial failures (update OK, merge fail), and other errors with specific user feedback.

* Fix entity update logic to handle renaming operations

- Add is_renaming condition check
- Ensure updates when entity renamed

* Refactor merge dialog and improve search history sync

- Extract MergeDialog to separate component
- Update search history on entity rename
- Add dropdown refresh trigger mechanism
- Sync query label with entity changes
- Force graph re-render after updates

* Add offline Swagger UI support with custom static file serving

- Disable default docs URL
- Add custom /docs endpoint
- Mount static Swagger UI files
- Include OAuth2 redirect handler
- Support offline documentation access

* Update redis requirement from <7.0.0,>=5.0.0 to >=5.0.0,<8.0.0

Updates the requirements on [redis](https://github.com/redis/redis-py) to permit the latest version.
- [Release notes](https://github.com/redis/redis-py/releases)
- [Changelog](https://github.com/redis/redis-py/blob/master/CHANGES)
- [Commits](redis/redis-py@v5.0.0...v7.0.1)

---
updated-dependencies:
- dependency-name: redis
  dependency-version: 7.0.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Fix entity and relationship deletion when no chunk references remain

* Move relationship ID sorting to before vector DB operations

• Remove verbose entity rebuild logging
• Sort IDs before vector DB updates
• Keep graph storage with original order

* Fix Entity Source IDs Tracking Problem

- Handle existing node updates properly in edge merging stage
- Fix source_ids merging logic
- Reorder entity deletion and optimize node operations
- Delete relationships before entities
- Add edge existence debugging logs

* Bump API version to 0247

* Fix z-index layering for GraphViewer UI panels

* Fix swagger docs page problem in dev mode

- Add /static to VITE_API_ENDPOINTS
- Update proxy rewrite rules
- Include static file serving
- Sync sample env file

* Remove enable_logging parameter from get_data_init_lock call in MilvusVectorDBStorage

* Fix worker process cleanup to prevent shared resource conflicts

• Add worker_exit hook in gunicorn config
• Add shutdown_manager parameter in finalize_share_data of share_storage
• Prevent Manager shutdown in workers
• Remove custom signal handlers

* Fix cleanup coordination between Gunicorn and UvicornWorker lifecycles

• Document UvicornWorker hook limitations
• Add GUNICORN_CMD_ARGS cleanup guard
• Prevent double cleanup in workers

* Replace GUNICORN_CMD_ARGS with custom LIGHTRAG_GUNICORN_MODE flag

• Use custom env var for mode detection
• Improve Gunicorn mode reliability

* Remove worker_exit hook and improve cleanup logging

• Remove unreliable worker_exit function
• Add debug logs for cleanup modes
• Move DEBUG_LOCKS to top of file

* Refactor service deployment to use direct process execution

- Remove bash wrapper script
- Update systemd service configuration
- Improve process management for gunicorn
- Simplify shared storage cleanup logic
- Update documentation for deployment

* Bump core version to 1.4.9.6 and API version to 0248

* Restore query generation example and fix README path reference

• Fix path from example/ to examples/
• Add generate_query.py implementation

* Refactor systemd service config to use environment variables

• Add LIGHTRAG_HOME environment variable
• Use .venv instead of venv directory

* docs: add frontend build steps to server installation guide

* Include static files in package distribution

- Add static dir to MANIFEST.in
- Update package data config
- Ensure static assets are bundled
- Fix missing static file issue

* Update uv.lock

* Refactor .gitignore

* Add Qdrant legacy collection migration with workspace support

- Add QdrantMigrationError exception
- Implement automatic data migration
- Support workspace-based partitioning
- Add migration verification logic
- Update collection naming scheme

* Bump core version to 1.4.9.7 and API version to 0249

* Remove redundant await call in file extraction pipeline

* Fix graph value handling for entity_id updates

• Use finalValue for entity_id changes
• Keep original value for other props
• Fix property update logic

* feat: Update node color and legent after entity_type changed

- Move color constants to utils module
- Extract resolveNodeColor function
- Update node colors on type changes
- Simplify hook color logic

* Optimize property edit dialog to use trimmed value consistently

* Add pycryptodome dependency for PDF encryption support

* Fix edge cleanup when deleting entities to prevent orphaned relationships

- Track edges to delete in set
- Clean VDB before node deletion
- Remove from relation chunks storage
- Prevent orphaned relationship data

* Add auto-refresh of popular labels when pipeline completes

• Monitor pipeline busy->idle transitions
• Reload labels on dropdown open if needed
• Add onBeforeOpen callback to AsyncSelect
• Clear refresh flags after processing
• Improve label sync with backend state

* Bump core version to 1.4.9.8 and API version to 0250

* Add frontend rebuild warning indicator to version display

- Return bool from check_frontend_build()
- Add ⚠️ symbol to outdated versions
- Show tooltip with rebuild message
- Add translations for warning text
- Fix tailwind config filename typo

* Add data/ directory to .gitignore

* Reduce logging verbosity in entity merge relation processing

* Improve entity merge logging by removing redundant message and fixing typo

* Rename function and variables for clarity in context building

- Rename _build_llm_context to _build_context_str
- Change text_units_context to chunks_context
- Move string building before early return
- Update log messages and comments
- Consistent variable naming throughout

* Remove redundant shutdown message from gunicorn

* Remove redundant separator lines in gunicorn shutdown handler

* Add PDF decryption support for password-protected files

• Add PDF_DECRYPT_PASSWORD env variable
• Check encryption status before reading
• Handle decrypt errors gracefully
• Log detailed error messages
• Support both encrypted/plain PDFs

* feat: add RAGAS evaluation framework for RAG quality assessment

This contribution adds a comprehensive evaluation system using the RAGAS
framework to assess LightRAG's retrieval and generation quality.

Features:
- RAGEvaluator class with four key metrics:
  * Faithfulness: Answer accuracy vs context
  * Answer Relevance: Query-response alignment
  * Context Recall: Retrieval completeness
  * Context Precision: Retrieved context quality
- HTTP API integration for live system testing
- JSON and CSV report generation
- Configurable test datasets
- Complete documentation with examples
- Sample test dataset included

Changes:
- Added lightrag/evaluation/eval_rag_quality.py (RAGAS evaluator implementation)
- Added lightrag/evaluation/README.md (comprehensive documentation)
- Added lightrag/evaluation/__init__.py (package initialization)
- Updated pyproject.toml with optional 'evaluation' dependencies
- Updated .gitignore to exclude evaluation results directory

Installation:
pip install lightrag-hku[evaluation]

Dependencies:
- ragas>=0.3.7
- datasets>=4.3.0
- httpx>=0.28.1
- pytest>=8.4.2
- pytest-asyncio>=1.2.0

* feat: add optional Langfuse observability integration

This contribution adds optional Langfuse support for LLM observability and tracing.
Langfuse provides a drop-in replacement for the OpenAI client that automatically
tracks all LLM interactions without requiring code changes.

Features:
- Optional Langfuse integration with graceful fallback
- Automatic LLM request/response tracing
- Token usage tracking
- Latency metrics
- Error tracking
- Zero code changes required for existing functionality

Implementation:
- Modified lightrag/llm/openai.py to conditionally use Langfuse's AsyncOpenAI
- Falls back to standard OpenAI client if Langfuse is not installed
- Logs observability status on import

Configuration:
To enable Langfuse tracing, install the observability extras and set environment variables:

```bash
pip install lightrag-hku[observability]

export LANGFUSE_PUBLIC_KEY="your_public_key"
export LANGFUSE_SECRET_KEY="your_secret_key"
export LANGFUSE_HOST="https://cloud.langfuse.com"  # or your self-hosted instance
```

If Langfuse is not installed or environment variables are not set, LightRAG
will use the standard OpenAI client without any functionality changes.

Changes:
- Modified lightrag/llm/openai.py (added optional Langfuse import)
- Updated pyproject.toml with optional 'observability' dependencies

Dependencies (optional):
- langfuse>=3.8.1

* docs: add generic test_dataset.json for evaluation examples
Test cases with generic examples about:
- LightRAG framework features and capabilities
- RAG system architecture and components
- Vector database support (ChromaDB, Neo4j, Milvus, etc.)
- LLM provider integrations (OpenAI, Anthropic, Ollama, etc.)
- RAG evaluation metrics explanation
- Deployment options (Docker, FastAPI, direct integration)
- Knowledge graph-based retrieval concepts

Changes:
- Added generic test_dataset.json with 8 LightRAG-focused test cases
- File added with git add -f to override test_* pattern

This provides realistic, reusable examples for users testing their
LightRAG deployments and helps demonstrate the evaluation framework.

* fix: Apply ruff formatting and rename test_dataset to sample_dataset

**Lint Fixes (ruff)**:
- Sort imports alphabetically (I001)
- Add blank line after import traceback (E302)
- Add trailing comma to dict literals (COM812)
- Reformat writer.writerow for readability (E501)

**Rename test_dataset.json → sample_dataset.json**:
- Avoids .gitignore pattern conflict (test_* is ignored)
- More descriptive name - it's a sample/template, not actual test data
- Updated all references in eval_rag_quality.py and README.md

Resolves lint-and-format CI check failure.
Addresses reviewer feedback about test dataset naming.

* fixed ruff format of csv path

* fix: Use actual retrieved contexts for RAGAS evaluation

**Critical Fix: Contexts vs Ground Truth**
- RAGAS metrics now evaluate actual retrieval performance
- Previously: Used ground_truth as contexts (always perfect scores)
- Now: Uses retrieved documents from LightRAG API (real evaluation)

**Changes to generate_rag_response (lines 100-156)**:
- Remove unused 'context' parameter
- Change return type: Dict[str, str] → Dict[str, Any]
- Extract contexts as list of strings from references[].text
- Return 'contexts' key instead of 'context' (JSON dump)
- Add response.raise_for_status() for better error handling
- Add httpx.HTTPStatusError exception handler

**Changes to evaluate_responses (lines 180-191)**:
- Line 183: Extract retrieved_contexts from rag_response
- Line 190: Use [retrieved_contexts] instead of [[ground_truth]]
- Now correctly evaluates: retrieval quality, not ground_truth quality

**Impact on RAGAS Metrics**:
- Context Precision: Now ranks actual retrieved docs by relevance
- Context Recall: Compares ground_truth against actual retrieval
- Faithfulness: Verifies answer based on actual retrieved contexts
- Answer Relevance: Unchanged (question-answer relevance)

Fixes incorrect evaluation methodology. Based on RAGAS documentation:
- contexts = retrieved documents from RAG system
- ground_truth = reference answer for context_recall metric

References:
- https://docs.ragas.io/en/stable/concepts/components/eval_dataset/
- https://docs.ragas.io/en/stable/concepts/metrics/

* Optimize RAGAS evaluation with parallel execution and chunk content enrichment

Added efficient RAG evaluation system with optimized API calls and comprehensive benchmarking.

Key Features:
- Single API call per evaluation (2x faster than before)
- Parallel evaluation based on MAX_ASYNC environment variable
- Chunk content enrichment in /query endpoint responses
- Comprehensive benchmark statistics (moyennes)
- NaN-safe metric calculations

API Changes:
- Added include_chunk_content parameter to QueryRequest (backward compatible)
- /query endpoint enriches references with actual chunk content when requested
- No breaking changes - default behavior unchanged

Evaluation Improvements:
- Parallel execution using asyncio.Semaphore (respects MAX_ASYNC)
- Shared HTTP client with connection pooling
- Proper timeout handling (3min connect, 5min read)
- Debug output for context retrieval verification
- Benchmark statistics with averages, min/max scores

Results:
- Moyenne RAGAS Score: 0.9772
- Perfect Faithfulness: 1.0000
- Perfect Context Recall: 1.0000
- Perfect Context Precision: 1.0000
- Excellent Answer Relevance: 0.9087

* docs: Add documentation and examples for include_chunk_content parameter

Added comprehensive documentation for the new include_chunk_content parameter
that enables retrieval of actual chunk text content in API responses.

Documentation Updates:
- Added "Include Chunk Content in References" section to API README
- Explained use cases: RAG evaluation, debugging, citations, transparency
- Provided JSON request/response examples
- Clarified parameter interaction with include_references

OpenAPI/Swagger Examples:
- Added "Response with chunk content" example to /query endpoint
- Shows complete reference structure with content field
- Demonstrates realistic chunk text content

This makes the feature discoverable through:
1. API documentation (README.md)
2. Interactive Swagger UI (http://localhost:9621/docs)
3. Code examples for developers

* Update lightrag/evaluation/eval_rag_quality.py for launguage

Co-authored-by: Copilot <[email protected]>

* Use logger in RAG evaluation and optimize reference content joins

* eval using open ai

* Update env.example with host/endpoint clarifications for LLM/embedding

* fix(api): Change content field from string to list in query responses

BREAKING CHANGE: The `content` field in query response references is now
an array of strings instead of a concatenated string. This preserves
individual chunk boundaries when a single file has multiple chunks.

Changes:
- Update QueryResponse Pydantic model to accept List[str] for content
- Modify query_text endpoint to return content as list (query_routes.py:425)
- Modify query_text_stream endpoint to support chunk content enrichment
- Update OpenAPI schema and examples to reflect array structure
- Update API README with breaking change notice and migration guide
- Fix RAGAS evaluation to flatten chunk content lists

* fix(api): change content field to list in query responses

BREAKING CHANGE: content field is now List[str] instead of str

- Add ReferenceItem Pydantic model for type safety
- Update /query and /query/stream to return content as list
- Update OpenAPI schema and examples
- Add migration guide to API README
- Fix RAGAS evaluation to handle list format

Addresses PR HKUDS#2297 feedback. Tested with RAGAS: 97.37% score.

* refactor: reorder Langfuse import logic for improved clarity

Moved logger import before Langfuse block to fix NameError.

* Add BuildKit cache mounts to optimize Docker build performance

- Enable BuildKit syntax directive
- Cache UV and Bun package downloads
- Update docs for cache optimization
- Improve rebuild efficiency

* fix(evaluation): Move import-time validation to runtime and improve documentation

Changes:
- Move sys.exit() calls from module level to __init__() method
- Raise proper exceptions (ImportError, ValueError, EnvironmentError) instead of sys.exit()
- Add lazy import for RAGEvaluator in __init__.py using __getattr__
- Update README to clarify sample_dataset.json contains generic test data (not personal)
- Fix README to reflect actual output format (JSON + CSV, not HTML)
- Improve documentation for custom test case creation

Addresses code review feedback about import-time validation and module exports.

* Improve Langfuse integration and stream response cleanup handling

• Check env vars before enabling Langfuse
• Move imports after env check logic
• Handle wrapper client aclose() issues
• Add debug logs for cleanup failures

* Adds initial LightRAG app integration with schema and processors

Introduces the LightRAG Retrieval-Augmented Generation framework as an Apolo app, including input/output schemas, types, and processors.
Adds Helm chart value processing, environment and persistence configurations, and output service discovery for deployment.
Includes scripts for generating type schemas and testing support, along with CI and linting setup tailored for the new app.
Provides a documentation loader script to ingest markdown files into LightRAG with flexible referencing modes.

Relates to MLO-469

* Removes deprecated Actionlint problem matcher

Deletes the obsolete Actionlint problem matcher configuration to clean up workflow files and reduce maintenance overhead.

Relates to MLO-469

* Enhance documentation loader and build scripts

Refactors the documentation loading script for improved readability, type hinting, and error handling. Updates CLI argument parsing and output formatting for clarity.

Replaces a simple makefile target with a more robust schema generation makefile including clean and test targets, and adds a placeholder test target to the Helm build system for consistency.

Removes obsolete lint configuration for streamlined tooling setup.

These changes improve maintainability and usability of schema generation and documentation loading workflows.

Relates to MLO-469

* Cleans up documentation and deployment scripts for consistency

Removes trailing whitespace and fixes minor formatting issues in Kubernetes deployment docs, storage report, and Helm chart files.

Standardizes indentation and spacing in Docker Compose and deployment shell scripts to improve readability and maintainability.

These edits improve documentation clarity and make deployment scripts more robust without altering functionality.

Relates to MLO-469

* feat(evaluation): Add sample documents for reproducible RAGAS testing

Add 5 markdown documents that users can index to reproduce evaluation results.

Changes:
- Add sample_documents/ folder with 5 markdown files covering LightRAG features
- Update sample_dataset.json with 3 improved, specific test questions
- Shorten and correct evaluation README (removed outdated info about mock responses)
- Add sample_documents reference with expected ~95% RAGAS score

Test Results with sample documents:
- Average RAGAS Score: 95.28%
- Faithfulness: 100%, Answer Relevance: 96.67%
- Context Recall: 88.89%, Context Precision: 95.56%

* Adds make targets to build and publish hooks Docker images

Introduces new make targets to build and push pre-commit hooks Docker images, enabling streamlined image management alongside Helm chart packaging.

Enhances the help output with details on the new hooks-related targets and supports versioned image tagging.

Relates to MLO-469

* chore: trigger CI re-run

* Adds Makefile targets for dependency management, linting, and testing

Introduces install and setup targets to streamline project dependency installation using Poetry.
Adds lint target to run pre-commit checks automatically.
Adds test-unit target for running the unit test suite with pytest.

These enhancements improve developer experience by standardizing common tasks within the Makefile.

Relates to MLO-469

* Adds unit tests for LightRAG input/output processors and package exports

Introduces comprehensive async unit tests to verify input processing merges extra values correctly and output generation returns expected URLs and ports.

Adds a basic test to confirm package exports remain intact. Cleans up the .gitignore by removing redundant test file ignores to allow test discovery.

Improves code quality and test coverage for the LightRAG app integration.

Relates to MLO-469

* chore: trigger CI re-run 2

* Updates CI workflow to support main branch

Expands CI triggers and release conditions to include both main and master branches.
Ensures workflows run consistently as the repository transitions or supports main as the default branch.

Relates to MLO-469

* Refactor build and packaging scripts, add Helm Makefile, and configure actionlint matcher

Removes legacy type schema generation Makefile and updates pre-commit hook to use new Makefile command.

Introduces a comprehensive Makefile to manage Helm chart packaging, publishing, testing, linting, and Docker image workflows, streamlining Apolo project automation and aligning with upstream LightRAG packaging conventions.

Adds GitHub Action problem matcher configuration for actionlint to improve workflow diagnostics.

Enhances maintainability and developer experience by centralizing build and deployment processes.

Relates to MLO-469

* Updates problem matcher owners to differentiated identifiers

Renames problem matcher owners to use distinct names for detailed and brief variants.
This clarifies matcher identification and avoids conflating different output patterns during linting.

Relates to MLO-469

* Adds comprehensive tests for inputs and outputs processors

Introduces extensive unit tests covering various LLM and embedding provider configurations for input processing, including compatibility and error handling for missing models.

Adds tests for output processing to handle cases when service or ingress hosts are unavailable, ensuring robustness in output URL generation.

Improves overall test coverage and reliability for the LightRAG app inputs and outputs components.

Relates to MLO-469

* Updates container image references for LightRAG app

Standardizes the container image name to use consistent capitalization and naming across deployment configs and build scripts.

Improves maintainability by centralizing the image target in Makefile variables, ensuring build and push commands use the updated image reference.

Relates to MLO-469

* Corrects container image name casing for consistency

Updates all references of the container image to use lowercase naming, ensuring consistency and avoiding potential deployment issues caused by case sensitivity.

Relates to MLO-469

* Updates image tagging to reflect current branch and pushes latest for main

Replaces the default image tag with a dynamic tag derived from the current branch or commit SHA, ensuring image tags correspond to the actual code state.

Adds logic to push the 'latest' tag alongside the branch tag when on the main branch, keeping the latest image readily accessible.

Improves tagging consistency and clarity in CI/CD workflows.

Relates to MLO-469

* Improves branch tag detection in image tagging

Enhances branch name extraction by prioritizing the pull request head ref and cleaning ref prefixes.

Improves tag normalization for consistent Docker image tagging across CI workflows and local runs.

Relates to MLO-469

* Refines CI workflow for Docker image building and pushing

Simplifies image build and push steps by consolidating redundant jobs.
Separates pushing logic based on ref type (branch vs tag) for clearer control and accuracy.
Removes metadata extraction step to streamline the workflow and avoid unnecessary complexity.
Improves environment variable handling for image tagging during pushes.
Enhances overall maintainability and clarity of the CI pipeline.

Relates to MLO-469

* Adds LightRAG-specific OpenAI compatible provider support with improved URL normalization

Introduces new LightRAG-specific OpenAI compatible chat and embeddings provider types to better support model and dimension overrides.

Refactors URL handling to normalize and unify endpoint construction across providers, ensuring robust full URL formation from partial configurations.

Updates defaults for several models and embedding dimensions to current, more accurate values. Enhances schema definitions accordingly to support these improvements.

Extends test coverage to verify behavior with/without explicit models and Hugging Face configurations, preventing previous errors and improving robustness.

Relates to MLO-469

* Removes Ollama provider support from LightRAG app

Eliminates Ollama LLM and embedding provider configurations, schema definitions, and related processing logic.
Cleans up imports, type declarations, and test cases to reflect removal.
This simplifies the codebase by dropping self-hosted Ollama server support.

Relates to MLO-469

* Update .env loading and add API authentication to RAG evaluator

• Load .env from current directory
• Support LIGHTRAG_API_KEY auth header
• Override=False for env precedence
• Add Bearer token to API requests
• Enable per-instance .env configs

* Add comprehensive configuration and  compatibility fixes for RAGAS

- Fix RAGAS LLM wrapper compatibility
- Add concurrency control for rate limits
- Add eval env vars for model config
- Improve error handling and logging
- Update documentation with examples

* Update RAG evaluation metrics to use class instances instead of objects

• Import metric classes not instances
• Instantiate metrics with () syntax

* Refactor LightRAG LLM and embedding provider configs

Replaces deprecated Anthropic and Gemini LLM providers with generalized OpenAI-compatible API classes to support broader deployments including vLLM and OpenRouter.
Updates embedding provider defaults and model handling to better align with hosted and self-hosted OpenAI-compatible services.
Simplifies input processing logic by unifying provider detection and configuration extraction.
Removes legacy schema definitions and adjusts tests accordingly for improved maintainability and extensibility of LLM and embedding integrations.

Relates to MLO-469

* Clean up RAG evaluator logging and remove excessive separator lines

• Remove excessive separator lines
• Add RAGAS concurrency comment
• Fix output buffer timing

* Refactor LLM and embedding configs to enforce strict types

Replaces deprecated OpenAI-like provider types with explicit OpenAI-compatible API and cloud provider models.
Enforces presence of required model fields for Hugging Face and embedding configurations, raising clear errors when missing.
Simplifies config extraction logic by rejecting unsupported config types.
Updates schemas and tests to reflect new types and stricter validations, improving consistency and reliability in LightRAG inputs processing.

Relates to MLO-469

* Update RAGAS evaluation to use gpt-4o-mini and improve compatibility

- Change default model to gpt-4o-mini
- Add deprecation warning suppression
- Update docs and comments for LightRAG
- Improve output formatting and timing

* Refines embedding provider naming and schema clarity

Renames official OpenAI embedding provider to clearly indicate cloud-based service
Removes redundant model field from embedding provider schemas to simplify configuration
Updates schema titles and descriptions to better represent supported embedding APIs, emphasizing self-hosted compatibility
Improves type definitions and validation logic for OpenAI-compatible embeddings to enforce presence of Hugging Face models
Enhances consistency and clarity across embedding provider usage and tests

Relates to MLO-469

* Refactor URL normalization and update OpenAI provider models

Improves URL construction by making port optional and omitting default ports for HTTPS/HTTP in normalized URLs. Adds support for an alternative base path attribute and cleans up schema defaults to better align with expected host and path values.

Simplifies hosted OpenAI provider models by removing explicit port fields and returning none for the port property, reflecting that ports are no longer required or forced. Adjusts tests to expect URLs without default port segments, ensuring consistency with normalization changes.

These changes clarify and streamline endpoint configuration handling and improve consistency across API definitions.

Relates to MLO-469

* Enhance LLM config handling and add port support for OpenAI providers

Adds optional port fields and base path customization to OpenAI-compatible and cloud provider schemas and types to improve endpoint flexibility.

Refines LLM config extraction logic to enforce Hugging Face model presence where required and support cloud provider configurations consistently.

Extends unit tests to cover new port handling and validation scenarios, ensuring robust processing of LLM input configurations.

Relates to MLO-469

* Refactors LLM config extraction to prioritize cloud provider type

Reorders conditions to handle cloud provider configurations before compatibility wrappers.
Removes redundant code to streamline extraction logic and improve clarity.

Relates to MLO-469

* Refactor embedding config extraction for cleaner type handling

Consolidates handling of a specific embedding provider by moving its type check to the top of the extraction method.

This prevents redundant checks and simplifies the code flow, making the embedding configuration extraction more straightforward and maintainable.

Relates to MLO-469

* Enhances input validation for LLM and embedding providers

Adds pre-validation logic to automatically select the appropriate LLM and embedding provider models based on input data structure, supporting both cloud and OpenAI-compatible configurations.

Replaces debug print with structured logging for LLM config processing.

Includes comprehensive unit tests to verify correct provider selection from input dicts.

Improves robustness and clarity in handling diverse configuration formats, facilitating flexible provider integration.

Relates to MLO-469-2

* Update dependencies and fix logging message typo

Replaces the existing dependency lock file with an updated one, refreshing package versions and dependencies to maintain compatibility and security.

Corrects a logging message typo by adding a missing space for improved readability.

Supports future development by ensuring an up-to-date dependency environment and clearer logging output.

* Adds Apolo extras and dependencies integration

Enables Apolo as an optional extras group with corresponding dependencies in the package manager configuration.

Updates setup commands to install Apolo extras by default in development environments.

Includes Apolo-related packages and their dependencies, ensuring seamless integration of Apolo platform features.

Adjusts package markers to include Apolo for appropriate conditional installs and compatibility.

Configures build system to recognize and include Apolo application source code.

Overall, prepares the project to support Apolo app integration with streamlined dependency management and installation.

* Removes deprecated package mode configuration

Cleans up the project configuration by eliminating an unused package mode setting.
Ensures the configuration stays current with the latest packaging standards.

* Updates schema meta-type to integration for OpenAI-compatible APIs

Changes schema metadata for OpenAI-compatible embeddings and API providers from "inline" to "integration" to better reflect their use case as external service integrations.
Improves consistency and clarity in type definitions and JSON schema annotations.

Relates to MLO-469-2

* Refactor OpenAI-compatible API types and update schemas for clarity

Renames and reorganizes OpenAI-compatible chat and embeddings API classes for clearer distinction between chat and embeddings providers.
Updates JSON schemas to align property defaults and constraints with the new types, reflecting accurate protocol, ports, endpoints, and required fields.
Adjusts input processing and tests to use the updated types, ensuring consistency and proper validation of self-hosted and hosted configurations.

Improves maintainability and accuracy of OpenAI-compatible API integration.

Relates to MLO-469-2

* Removes deprecated input field validators

Eliminates outdated pre-validation logic for input configurations to simplify the input model and reduce complexity.
This cleanup helps maintain clarity by relying on updated validation mechanisms without custom pre-processing steps.

Relates to MLO-469-2

* Sets default and validation for embedding dimensions in schema and types

Adds a default value of 1024 and enforces a positive integer constraint for embedding vector dimensions in both schema and type definitions.
Removes the mandatory requirement to specify dimensions, improving usability and preventing validation errors when the field is omitted.
Clarifies the description to indicate the default applies if no explicit value is provided.

Relates to MLO-469-2

* Allows HTTP or HTTPS protocol for embeddings endpoint

Expands protocol options to include both HTTP and HTTPS for the embeddings endpoint, improving flexibility in network configuration.

Clarifies the port description to refer generically to the network port instead of HTTPS-only.

Relates to MLO-469-2

* Defaults missing API keys to "no-auth" in inputs processor

Ensures that empty or missing API keys are normalized to "no-auth" for LLM and embedding configurations.

This prevents passing None or empty strings downstream, simplifying authentication handling and improving consistency in environment variable setup.

Includes corresponding test updates to validate this behavior.

Relates to MLO-469-2

* Updates container image tag to v1.4.9.7

Bumps the application image version to incorporate the latest upstream changes and improvements.

Ensures the deployment uses the updated release for enhanced features and bug fixes.

* Updates LightRAG app container image tag to v1.4.9.7

Bumps the container image tag to a newer version to ensure the app uses the latest features and fixes.

Supports ongoing upstream updates and improves deployment consistency.

* Updates app labels and bumps dependencies

Changes app labeling from a generic Kubernetes key to an explicit application name to improve service identification.

Also upgrades critical internal dependencies to their latest versions to ensure compatibility and access new features.

Relates to MLO-469-2

* Corrects label keys for LightRAG application outputs

Updates label keys from a generic Kubernetes naming convention to a simpler "application" key for LightRAG.

Ensures consistency in labels used for retrieving internal and external web URLs in both production code and tests, preventing potential mismatches and errors when querying services.

Relates to MLO-469-2

* Adds application label to Kubernetes resource templates

Includes a standard application label alongside common and selector labels to improve resource identification and filtering in Kubernetes manifests.

Enhances consistency and enables better organization for tooling that relies on this label.

* Updates container image tag to v1.4.9.7

Advances the deployed image version to incorporate latest changes and fixes from upstream.

Ensures the service runs with the most recent stable release for improved reliability.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Anush008 <[email protected]>
Co-authored-by: yangdx <[email protected]>
Co-authored-by: Won-Kyu Park <[email protected]>
Co-authored-by: DivinesLight <[email protected]>
Co-authored-by: haseebuchiha <[email protected]>
Co-authored-by: Daniel.y <[email protected]>
Co-authored-by: Lucky Verma <[email protected]>
Co-authored-by: Yasiru Rangana <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xiaojunxiang <[email protected]>
Co-authored-by: Mobious <[email protected]>
Co-authored-by: Anush008 <[email protected]>
Co-authored-by: anouarbm <[email protected]>
Co-authored-by: ben moussa anouar <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants