A custom NeMo Agent Toolkit (NAT) evaluator implementing the FreshEval Relaxed methodology for evaluating factual accuracy of model responses.
This evaluator implements the FreshEval Relaxed evaluation methodology from FreshLLMs. It evaluates model responses under relaxed criteria where hallucinations, outdated information, and ill-formed answers are allowed, as long as the primary answer is accurate.
# From the evaluation/freshqa directory
pip install -e .The FreshQA dataset is not included in the repository. Download it before running evaluation:
- Download from the FreshLLMs GitHub repository
- Place the dataset files in
frontends/benchmarks/freshqa/data/:FreshQA_v112425.json(required)FreshQA_v112425.csv(optional, for reference)
The FreshQA evaluator uses an LLM judge. The default configs use OpenAI GPT-4o as the judge.
- Choose a judge model (e.g. OpenAI GPT-4o or Gemini 2.5 Flash).
- Obtain an API key and set it in
deploy/.env(e.g.OPENAI_API_KEY=your_key). - To use a different judge, add that LLM under
llms:and seteval.evaluators.freshqa.llm_nameto its name.
Set in deploy/.env: NVIDIA_API_KEY (agent), TAVILY_API_KEY (web search).
# Shallow research only
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_shallow_research_only.yml
# Full workflow (orchestration + research agents)
dotenv -f deploy/.env run nat eval --config_file frontends/benchmarks/freshqa/configs/config_full_workflow.ymlResults go to frontends/benchmarks/freshqa/results (or the config’s output_dir).
The FreshEval Relaxed methodology:
-
Relaxed Criteria: Allows hallucinations, outdated information, and ill-formed answers as long as the primary answer is accurate.
-
Confident Answers Required: Credits responses only if they provide a confident and definitive answer, or the correct answer can be obviously inferred.
-
False Premise Handling: For false-premise questions, the response must explicitly point out the presence of a false premise to receive credit.
-
Name Accuracy: For answers involving names of entities (e.g., people), complete names or commonly recognized names are expected.
-
Numerical Precision: Approximate numbers are generally not accepted unless explicitly included in the ground-truth answers.
The evaluator produces:
- accuracy: Mean accuracy across all evaluated items (0-1)
- total_correct: Number of correctly answered questions
- total_evaluated: Total number of items evaluated
- average_score: Accuracy as a percentage (0-100)
Each item includes detailed reasoning with:
is_correct: Boolean indicating if the response was correctrating: "TRUE" or "FALSE"explanation: LLM's explanation for the ratingquestion,model_response,correct_answers: Context for the evaluation
The evaluator expects a CSV file with the following columns:
question: The question to be answeredanswer_0throughanswer_9: Acceptable correct answers (can have multiple)split: Optional filter column (e.g., "TEST", "DEV")
FreshQA Dataset Intro
The FreshQA benchmark dataset is designed to evaluate how well language models handle questions requiring up-to-date world knowledge.
FreshQA categorizes questions along three key dimensions:
| Dimension | Values | Description |
|---|---|---|
| Fact Type | never-changing, slow-changing, fast-changing |
How frequently the answer changes over time |
| Num Hops | one-hop, multi-hop |
Whether the question requires single or chained reasoning |
| False Premise | True, False |
Whether the question contains an incorrect assumption |
These questions have answers that remain constant over time.
Q: What is the largest mammal in the world? A: Blue whale
Q: Who founded Amazon? A: Jeff Bezos
Q: What is the capital of the commonwealth of Massachusetts? A: Boston
Q: On what date did the Berlin Wall fall? A: November 9, 1989
Q: Who painted The Starry Night? A: Vincent van Gogh
Q: What's the capital of the largest state in America? A: Juneau (Alaska → capital is Juneau)
Q: Which member of The Beatles was born first? A: Ringo Starr
Q: Where was the primary designer of AlexNet born? A: Ukraine (Alex Krizhevsky → born in Ukraine)
These questions have answers that change infrequently, perhaps yearly or less often.
Q: Who is the CEO of Google? A: Sundar Pichai
Q: How many grand slams does Rafael Nadal have? A: 22
Q: What is the best-selling video game franchise of all time? A: Mario
Q: What is the top-ranked university in the US according to the US News Ranking? A: Princeton University
Q: Who is the current German chancellor? A: Friedrich Merz
Q: Which country has President Joe Biden visited the most during his presidency so far? A: United Kingdom
Q: Who has the most liked post on Instagram? A: Lionel Messi
These questions have answers that can change frequently, sometimes daily or weekly.
Q: Who is the richest man on earth? A: Elon Musk (as of dataset creation)
Q: Who is the current No. 1 on the Billboard Artist 100? A: Taylor Swift (as of dataset creation)
Q: What is the latest highest-grossing movie of the week at the Box office? A: (Changes weekly)
Q: What is Geoff Hinton's h-index? A: 192 (as of dataset creation)
Q: How old is the richest person in the world? A: 54 years old (Elon Musk's age at dataset creation)
Q: In what country was the most recent Time Person of the Year born? A: USA (Donald Trump → born in USA)
These questions contain incorrect assumptions that the model must identify and address.
Q: What is the name of Leonardo DiCaprio's third child? A: Leonardo DiCaprio does not have any children.
Q: When did the UK adopt the Euro? A: The United Kingdom has never adopted the Euro.
Q: How many home runs did Real Madrid score? A: Real Madrid is a professional soccer club and thus cannot score home runs. In soccer, players score goals, not home runs.
Q: When was the iPhone 9 released? A: Apple has never released an iPhone 9 nor does it plan to.
Q: When did the first human land on Mars? A: No humans have been to Mars yet.
Q: By how much is 3 bigger than 4? A: 3 is smaller than 4 by 1.
Q: Which antibiotics are most effective against the flu? A: Antibiotics are only effective against bacteria while the flu is a virus.
| Category | Count |
|---|---|
| Total Questions | 599 |
| TEST Split | ~400 |
| DEV Split | ~100 |
- Never-changing: Questions with permanent answers
- Slow-changing: Questions reviewed occasionally or yearly
- Fast-changing: Questions requiring frequent updates
- One-hop: Direct factual lookups
- Multi-hop: Requires chaining multiple facts together
FreshQA benchmark dataset: FreshLLMs GitHub
For more information about the FreshQA benchmark methodology, see the original research paper.