Skip to content

Latest commit

 

History

History
264 lines (164 loc) · 8.08 KB

File metadata and controls

264 lines (164 loc) · 8.08 KB

FreshQA Evaluator

A custom NeMo Agent Toolkit (NAT) evaluator implementing the FreshEval Relaxed methodology for evaluating factual accuracy of model responses.

Overview

This evaluator implements the FreshEval Relaxed evaluation methodology from FreshLLMs. It evaluates model responses under relaxed criteria where hallucinations, outdated information, and ill-formed answers are allowed, as long as the primary answer is accurate.

Installation

# From the evaluation/freshqa directory
pip install -e .

Dataset Setup

The FreshQA dataset is not included in the repository. Download it before running evaluation:

  1. Download from the FreshLLMs GitHub repository
  2. Place the dataset files in frontends/benchmarks/freshqa/data/:
    • FreshQA_v112425.json (required)
    • FreshQA_v112425.csv (optional, for reference)

Prerequisites

Judge model and API key

The FreshQA evaluator uses an LLM judge. The default configs use OpenAI GPT-4o as the judge.

  1. Choose a judge model (e.g. OpenAI GPT-4o or Gemini 2.5 Flash).
  2. Obtain an API key and set it in deploy/.env (e.g. OPENAI_API_KEY=your_key).
  3. To use a different judge, add that LLM under llms: and set eval.evaluators.freshqa.llm_name to its name.

Other API keys

Set in deploy/.env: NVIDIA_API_KEY (agent), TAVILY_API_KEY (web search).

Quick Start

# Shallow research only
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_shallow_research_only.yml

# Full workflow (orchestration + research agents)
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_full_workflow.yml

Results go to frontends/benchmarks/freshqa/results (or the config’s output_dir).

Evaluation Methodology

The FreshEval Relaxed methodology:

  1. Relaxed Criteria: Allows hallucinations, outdated information, and ill-formed answers as long as the primary answer is accurate.

  2. Confident Answers Required: Credits responses only if they provide a confident and definitive answer, or the correct answer can be obviously inferred.

  3. False Premise Handling: For false-premise questions, the response must explicitly point out the presence of a false premise to receive credit.

  4. Name Accuracy: For answers involving names of entities (e.g., people), complete names or commonly recognized names are expected.

  5. Numerical Precision: Approximate numbers are generally not accepted unless explicitly included in the ground-truth answers.

Output Metrics

The evaluator produces:

  • accuracy: Mean accuracy across all evaluated items (0-1)
  • total_correct: Number of correctly answered questions
  • total_evaluated: Total number of items evaluated
  • average_score: Accuracy as a percentage (0-100)

Each item includes detailed reasoning with:

  • is_correct: Boolean indicating if the response was correct
  • rating: "TRUE" or "FALSE"
  • explanation: LLM's explanation for the rating
  • question, model_response, correct_answers: Context for the evaluation

Dataset Format

The evaluator expects a CSV file with the following columns:

  • question: The question to be answered
  • answer_0 through answer_9: Acceptable correct answers (can have multiple)
  • split: Optional filter column (e.g., "TEST", "DEV")

References


FreshQA Dataset Intro

The FreshQA benchmark dataset is designed to evaluate how well language models handle questions requiring up-to-date world knowledge.


Dataset Overview

FreshQA categorizes questions along three key dimensions:

Dimension Values Description
Fact Type never-changing, slow-changing, fast-changing How frequently the answer changes over time
Num Hops one-hop, multi-hop Whether the question requires single or chained reasoning
False Premise True, False Whether the question contains an incorrect assumption

Never-Changing Facts

These questions have answers that remain constant over time.

One-Hop Examples

Q: What is the largest mammal in the world? A: Blue whale

Q: Who founded Amazon? A: Jeff Bezos

Q: What is the capital of the commonwealth of Massachusetts? A: Boston

Q: On what date did the Berlin Wall fall? A: November 9, 1989

Q: Who painted The Starry Night? A: Vincent van Gogh

Multi-Hop Examples

Q: What's the capital of the largest state in America? A: Juneau (Alaska → capital is Juneau)

Q: Which member of The Beatles was born first? A: Ringo Starr

Q: Where was the primary designer of AlexNet born? A: Ukraine (Alex Krizhevsky → born in Ukraine)


Slow-Changing Facts

These questions have answers that change infrequently, perhaps yearly or less often.

One-Hop Examples

Q: Who is the CEO of Google? A: Sundar Pichai

Q: How many grand slams does Rafael Nadal have? A: 22

Q: What is the best-selling video game franchise of all time? A: Mario

Q: What is the top-ranked university in the US according to the US News Ranking? A: Princeton University

Q: Who is the current German chancellor? A: Friedrich Merz

Multi-Hop Examples

Q: Which country has President Joe Biden visited the most during his presidency so far? A: United Kingdom

Q: Who has the most liked post on Instagram? A: Lionel Messi


Fast-Changing Facts

These questions have answers that can change frequently, sometimes daily or weekly.

One-Hop Examples

Q: Who is the richest man on earth? A: Elon Musk (as of dataset creation)

Q: Who is the current No. 1 on the Billboard Artist 100? A: Taylor Swift (as of dataset creation)

Q: What is the latest highest-grossing movie of the week at the Box office? A: (Changes weekly)

Q: What is Geoff Hinton's h-index? A: 192 (as of dataset creation)

Multi-Hop Examples

Q: How old is the richest person in the world? A: 54 years old (Elon Musk's age at dataset creation)

Q: In what country was the most recent Time Person of the Year born? A: USA (Donald Trump → born in USA)


False Premise Questions

These questions contain incorrect assumptions that the model must identify and address.

Factual Corrections

Q: What is the name of Leonardo DiCaprio's third child? A: Leonardo DiCaprio does not have any children.

Q: When did the UK adopt the Euro? A: The United Kingdom has never adopted the Euro.

Q: How many home runs did Real Madrid score? A: Real Madrid is a professional soccer club and thus cannot score home runs. In soccer, players score goals, not home runs.

Temporal Corrections

Q: When was the iPhone 9 released? A: Apple has never released an iPhone 9 nor does it plan to.

Q: When did the first human land on Mars? A: No humans have been to Mars yet.

Logical Corrections

Q: By how much is 3 bigger than 4? A: 3 is smaller than 4 by 1.

Q: Which antibiotics are most effective against the flu? A: Antibiotics are only effective against bacteria while the flu is a virus.


Dataset Statistics

Category Count
Total Questions 599
TEST Split ~400
DEV Split ~100

By Fact Type

  • Never-changing: Questions with permanent answers
  • Slow-changing: Questions reviewed occasionally or yearly
  • Fast-changing: Questions requiring frequent updates

By Reasoning Complexity

  • One-hop: Direct factual lookups
  • Multi-hop: Requires chaining multiple facts together

Source

FreshQA benchmark dataset: FreshLLMs GitHub

For more information about the FreshQA benchmark methodology, see the original research paper.