FreshQA Evaluator

A custom NeMo Agent Toolkit (NAT) evaluator implementing the FreshEval Relaxed methodology for evaluating factual accuracy of model responses.

Overview

This evaluator implements the FreshEval Relaxed evaluation methodology from FreshLLMs. It evaluates model responses under relaxed criteria where hallucinations, outdated information, and ill-formed answers are allowed, as long as the primary answer is accurate.

Installation

# From the evaluation/freshqa directory
pip install -e .

Dataset Setup

The FreshQA dataset is not included in the repository. Download it before running evaluation:

Download from the FreshLLMs GitHub repository
Place the dataset files in frontends/benchmarks/freshqa/data/:
- FreshQA_v112425.json (required)
- FreshQA_v112425.csv (optional, for reference)

Prerequisites

Judge model and API key

The FreshQA evaluator uses an LLM judge. The default configs use OpenAI GPT-4o as the judge.

Choose a judge model (e.g. OpenAI GPT-4o or Gemini 2.5 Flash).
Obtain an API key and set it in deploy/.env (e.g. OPENAI_API_KEY=your_key).
To use a different judge, add that LLM under llms: and set eval.evaluators.freshqa.llm_name to its name.

Other API keys

Set in deploy/.env: NVIDIA_API_KEY (agent), TAVILY_API_KEY (web search).

Quick Start

# Shallow research only
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_shallow_research_only.yml

# Full workflow (orchestration + research agents)
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_full_workflow.yml

Results go to frontends/benchmarks/freshqa/results (or the config’s output_dir).

Evaluation Methodology

The FreshEval Relaxed methodology:

Relaxed Criteria: Allows hallucinations, outdated information, and ill-formed answers as long as the primary answer is accurate.
Confident Answers Required: Credits responses only if they provide a confident and definitive answer, or the correct answer can be obviously inferred.
False Premise Handling: For false-premise questions, the response must explicitly point out the presence of a false premise to receive credit.
Name Accuracy: For answers involving names of entities (e.g., people), complete names or commonly recognized names are expected.
Numerical Precision: Approximate numbers are generally not accepted unless explicitly included in the ground-truth answers.

Output Metrics

The evaluator produces:

accuracy: Mean accuracy across all evaluated items (0-1)
total_correct: Number of correctly answered questions
total_evaluated: Total number of items evaluated
average_score: Accuracy as a percentage (0-100)

Each item includes detailed reasoning with:

is_correct: Boolean indicating if the response was correct
rating: "TRUE" or "FALSE"
explanation: LLM's explanation for the rating
question, model_response, correct_answers: Context for the evaluation

Dataset Format

The evaluator expects a CSV file with the following columns:

question: The question to be answered
answer_0 through answer_9: Acceptable correct answers (can have multiple)
split: Optional filter column (e.g., "TEST", "DEV")

References

FreshQA Dataset Intro

The FreshQA benchmark dataset is designed to evaluate how well language models handle questions requiring up-to-date world knowledge.

Dataset Overview

FreshQA categorizes questions along three key dimensions:

Dimension	Values	Description
Fact Type	`never-changing`, `slow-changing`, `fast-changing`	How frequently the answer changes over time
Num Hops	`one-hop`, `multi-hop`	Whether the question requires single or chained reasoning
False Premise	`True`, `False`	Whether the question contains an incorrect assumption

Never-Changing Facts

These questions have answers that remain constant over time.

One-Hop Examples

Q: What is the largest mammal in the world? A: Blue whale

Q: Who founded Amazon? A: Jeff Bezos

Q: What is the capital of the commonwealth of Massachusetts? A: Boston

Q: On what date did the Berlin Wall fall? A: November 9, 1989

Q: Who painted The Starry Night? A: Vincent van Gogh

Multi-Hop Examples

Q: What's the capital of the largest state in America? A: Juneau (Alaska → capital is Juneau)

Q: Which member of The Beatles was born first? A: Ringo Starr

Q: Where was the primary designer of AlexNet born? A: Ukraine (Alex Krizhevsky → born in Ukraine)

Slow-Changing Facts

These questions have answers that change infrequently, perhaps yearly or less often.

One-Hop Examples

Q: Who is the CEO of Google? A: Sundar Pichai

Q: How many grand slams does Rafael Nadal have? A: 22

Q: What is the best-selling video game franchise of all time? A: Mario

Q: What is the top-ranked university in the US according to the US News Ranking? A: Princeton University

Q: Who is the current German chancellor? A: Friedrich Merz

Multi-Hop Examples

Q: Which country has President Joe Biden visited the most during his presidency so far? A: United Kingdom

Q: Who has the most liked post on Instagram? A: Lionel Messi

Fast-Changing Facts

These questions have answers that can change frequently, sometimes daily or weekly.

One-Hop Examples

Q: Who is the richest man on earth? A: Elon Musk (as of dataset creation)

Q: Who is the current No. 1 on the Billboard Artist 100? A: Taylor Swift (as of dataset creation)

Q: What is the latest highest-grossing movie of the week at the Box office? A: (Changes weekly)

Q: What is Geoff Hinton's h-index? A: 192 (as of dataset creation)

Multi-Hop Examples

Q: How old is the richest person in the world? A: 54 years old (Elon Musk's age at dataset creation)

Q: In what country was the most recent Time Person of the Year born? A: USA (Donald Trump → born in USA)

False Premise Questions

These questions contain incorrect assumptions that the model must identify and address.

Factual Corrections

Q: What is the name of Leonardo DiCaprio's third child? A: Leonardo DiCaprio does not have any children.

Q: When did the UK adopt the Euro? A: The United Kingdom has never adopted the Euro.

Q: How many home runs did Real Madrid score? A: Real Madrid is a professional soccer club and thus cannot score home runs. In soccer, players score goals, not home runs.

Temporal Corrections

Q: When was the iPhone 9 released? A: Apple has never released an iPhone 9 nor does it plan to.

Q: When did the first human land on Mars? A: No humans have been to Mars yet.

Logical Corrections

Q: By how much is 3 bigger than 4? A: 3 is smaller than 4 by 1.

Q: Which antibiotics are most effective against the flu? A: Antibiotics are only effective against bacteria while the flu is a virus.

Dataset Statistics

Category	Count
Total Questions	599
TEST Split	~400
DEV Split	~100

By Fact Type

Never-changing: Questions with permanent answers
Slow-changing: Questions reviewed occasionally or yearly
Fast-changing: Questions requiring frequent updates

By Reasoning Complexity

One-hop: Direct factual lookups
Multi-hop: Requires chaining multiple facts together

Source

FreshQA benchmark dataset: FreshLLMs GitHub

For more information about the FreshQA benchmark methodology, see the original research paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FreshQA Evaluator

Overview

Installation

Dataset Setup

Prerequisites

Judge model and API key

Other API keys

Quick Start

Evaluation Methodology

Output Metrics

Dataset Format

References

Dataset Overview

Never-Changing Facts

One-Hop Examples

Multi-Hop Examples

Slow-Changing Facts

One-Hop Examples

Multi-Hop Examples

Fast-Changing Facts

One-Hop Examples

Multi-Hop Examples

False Premise Questions

Factual Corrections

Temporal Corrections

Logical Corrections

Dataset Statistics

By Fact Type

By Reasoning Complexity

Source

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

FreshQA Evaluator

Overview

Installation

Dataset Setup

Prerequisites

Judge model and API key

Other API keys

Quick Start

Evaluation Methodology

Output Metrics

Dataset Format

References

Dataset Overview

Never-Changing Facts

One-Hop Examples

Multi-Hop Examples

Slow-Changing Facts

One-Hop Examples

Multi-Hop Examples

Fast-Changing Facts

One-Hop Examples

Multi-Hop Examples

False Premise Questions

Factual Corrections

Temporal Corrections

Logical Corrections

Dataset Statistics

By Fact Type

By Reasoning Complexity

Source