📊 Built-in evaluation harness with recommended metrics and benchmark datasets #722

CharnaParkey · 2025-12-22T23:08:25Z

CharnaParkey
Dec 22, 2025
Maintainer

Today, you can hook up any evaluation harness to OpenRAG using OpenTelemetry. But we're considering providing an out-of-the-box evaluation harness that gives you everything you need to get started immediately.

What would be included:

1. Recommended Metrics - Curated set of metrics you should be tracking:

Answer quality (relevance, accuracy, faithfulness/groundedness)
Retrieval performance (precision, recall, MRR, NDCG)
Latency and response time
Context utilization
Cost per query

2. Benchmark Datasets - Standard datasets to measure against:

Industry-standard RAG benchmarks
Domain-specific test sets
Comparison baselines for common use cases

3. Synthetic Data Generation - Generate test data on demand:

Automatically create evaluation questions from your documents
Generate diverse query variations
Create edge cases and stress tests
Build regression test suites

How it would work:

1. Install OpenRAG → configure OpenRAG with your preferred models -> evaluation harness ready to use
2. Run baseline evaluation on provided benchmark datasets
3. Upload your own documents, make modifications to underlying flows etc
4. Generate synthetic test data from your own documents
5. Track metrics over time as you improve your pipeline
6. Compare against benchmarks and your own baselines

Why this matters:

No setup required - Start evaluating immediately (vs. hooking up your own harness)
Best practices built-in - Know what to measure without being an expert
Apples-to-apples comparison - Benchmark against standard datasets
Test data generation - Don't need to manually create test sets
Catch regressions - Automated testing before you ship changes

Questions for the community:

Which metrics would be most valuable to have out of the box?
What benchmark datasets would help you most? (general, domain-specific, multilingual?)
For synthetic data generation: what kinds of test queries would you want generated?
How would you want to view/export results? (UI dashboard, CLI reports, CI/CD integration?)
Would you want this integrated with existing eval frameworks or standalone?

Vote with 👍 if you'd use this feature!

Tell us: How do you evaluate your RAG applications today? What makes it painful?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 Built-in evaluation harness with recommended metrics and benchmark datasets #722

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

📊 Built-in evaluation harness with recommended metrics and benchmark datasets #722

Uh oh!

CharnaParkey Dec 22, 2025 Maintainer

Replies: 0 comments

CharnaParkey
Dec 22, 2025
Maintainer