Skip to content

Simplify: evals/ module#42

Merged
William-Hill merged 6 commits into
mainfrom
simplify/evals
Mar 5, 2026
Merged

Simplify: evals/ module#42
William-Hill merged 6 commits into
mainfrom
simplify/evals

Conversation

@William-Hill

@William-Hill William-Hill commented Mar 5, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #26

  • Code reuse (1.1–1.3): Remove dead fallback keys from result extraction; pass actual source_urls, report, and extracted_contents to evaluations — batch evals now have parity with the live research path
  • Code quality (2.1–2.4): Remove 3 unused parameters (eval_types, interactive, output_csv_path) from runner + CLI; fix mixed imports; use modern type hints
  • Efficiency (3.1–3.4): Filter NULL results in SQL; add ORDER BY created_at DESC for deterministic LIMIT; keep session open during asyncio.gather to prevent detached ORM objects; add full traceback logging for failures

Complexity Metrics

Metric Before After Delta
Source LOC 100 113 +13
Source files 2 2 0
Functions/methods 1 2 +1
Avg cyclomatic complexity 7.0 (B) 9.5 (B) +2.5
CC grade distribution B:1 B:1, C:1
Avg maintainability index 85.0 64.75 -20.25

LOC and complexity increased because the module now does meaningful work it was previously skipping (passing real sources, report, and extracted contents to evaluators).

Test plan

  • Verify python -c "from d4bl.evals.runner import run_evals_and_log" imports cleanly
  • Verify python scripts/run_evals.py --help shows updated CLI (no --eval-types, --interactive, --output-csv)
  • Run python scripts/run_evals.py --max-rows 1 against a DB with completed jobs to confirm evaluations execute with real sources/report

Summary by CodeRabbit

  • Bug Fixes

    • Fixed potential division-by-zero in average complexity output.
    • Added warnings and improved error reporting for failed or empty evaluation jobs.
  • Documentation

    • Improved report table layout and clarified "Key changes" formatting.
    • Reworded CLI help/examples to reference "completed research jobs" and clarified max-rows description.
  • New Features

    • Added explicit concurrency control for running evaluations.
  • Chores

    • Removed deprecated CLI options and streamlined evaluation run options.

Closes #25 — addresses 11 findings across code reuse, quality, and
efficiency in src/d4bl/observability/.
Tracks LOC, cyclomatic complexity, and maintainability index
before/after each module simplification PR. Includes per-module
baselines for remaining work and "how to regenerate" instructions.
- Remove dead fallback keys (answer, text, output) from result extraction
- Pass actual source_urls, report, and extracted_contents to evaluations
- Remove unused parameters (eval_types, interactive, output_csv_path)
- Fix mixed imports, remove unused Path import, use modern type hints
- Filter NULL results in SQL, add ORDER BY for deterministic LIMIT
- Keep session open during asyncio.gather to prevent detached ORM objects
- Add full traceback logging for evaluation failures
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Refactors batch evaluation flow: removes three CLI options, centralizes job result extraction via _extract_eval_inputs(), updates run_evals_and_log() signature to add concurrency and tighten types, queries completed jobs with SQL-level filters and ordering, and runs evaluations concurrently with improved error and trace logging. Docs formatting and a division-by-zero guard were also adjusted.

Changes

Cohort / File(s) Summary
Documentation & Metrics
docs/SIMPLIFICATION_REPORT.md
Reformatted the metrics table (delta values moved to a Delta column), converted per-module "Key changes" rows into bold-prefixed inline text, and added a division-by-zero guard when printing average complexity.
CLI Simplification
scripts/run_evals.py
Reworded CLI description/examples to reference "jobs"; removed --eval-types, --interactive, and --output-csv options and related parsing; updated main() to pass only max_rows, concurrency, and job_ids to the runner.
Eval Runner Refactor
src/d4bl/evals/runner.py
Added `_extract_eval_inputs(job: ResearchJob) -> dict

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main/CLI
    participant Runner as run_evals_and_log()
    participant DB as Database
    participant Jobs as ResearchJob list
    participant Extract as _extract_eval_inputs()
    participant ThreadPool as ThreadPoolExecutor
    participant Eval as run_comprehensive_evaluation()

    Main->>Runner: call(max_rows, concurrency, selected_job_ids)
    Runner->>DB: query completed jobs where result IS NOT NULL,<br/>filter by selected_job_ids, ORDER BY created_at DESC, LIMIT max_rows
    DB-->>Jobs: return ResearchJob instances
    Runner->>Jobs: iterate and spawn tasks (asyncio.gather)
    loop for each job (concurrent)
        Jobs->>Extract: _extract_eval_inputs(job)
        Extract-->>Runner: inputs dict or None
        alt inputs present
            Runner->>ThreadPool: submit run_comprehensive_evaluation(inputs)
            ThreadPool->>Eval: execute evaluation
            Eval-->>ThreadPool: evaluation result
            ThreadPool-->>Runner: result (includes trace_id)
            Runner->>Main: log/store evaluation result
        else inputs missing
            Runner->>Main: log warning (no usable result)
        end
    end
    Runner-->>Main: complete (errors logged with tracebacks if any)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main objective: simplifying the evals module by removing unused parameters, fixing code reuse issues, and improving efficiency.
Linked Issues check ✅ Passed The PR successfully addresses all primary and secondary objectives from issue #26: removes dead fallback keys, passes real source_urls/report/extracted_contents, removes unused parameters, normalizes imports, updates type hints, filters NULL results in SQL, adds ORDER BY determinism, keeps DB session open during async operations, and preserves full tracebacks.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the objectives outlined in issue #26; the documentation update to SIMPLIFICATION_REPORT.md is supporting context for the code changes and not out of scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch simplify/evals

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/SIMPLIFICATION_REPORT.md`:
- Around line 164-166: The script at the end uses `count` and `total_cc` to
print an average but can divide by zero if `count` is 0; update the final
printing logic to check `count == 0` before computing `total_cc / count` (the
`count` and `total_cc` variables and the two print statements are the relevant
symbols), and when `count` is zero print a sensible message or show 0.0 for the
average instead of performing the division to avoid a ZeroDivisionError.
- Around line 31-35: The tables under the "Completed Modules" section use a
4-column header ("| Metric | Before | After | Delta |") but data rows lack
values for Before/After/Delta; update each affected table row to either supply
concrete Before, After, and Delta values for every Metric row (e.g., fill
numeric LOC or % deltas) or simplify the table to a two-column format ("Metric |
Notes") and adjust the header and all rows accordingly; ensure every table under
"Completed Modules" uses the same consistent column count as the header
(reference the header string "| Metric | Before | After | Delta |" and the
metric rows currently under that section) so Markdown renders correctly.

In `@src/d4bl/evals/runner.py`:
- Line 24: The assignment to research_output in runner.py is over the 100-char
limit; split the expression across multiple lines to keep each line ≤100 chars
by extracting the conditional check into a short temporary or by using a
two-step assignment: first compute a safe_raw = raw_result.get("raw_output", "")
if isinstance(raw_result, dict) else "" and then set research_output =
str(safe_raw).strip(); reference the research_output variable and raw_result
check in your edit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 08104100-3ed9-4baf-8fdd-76b4fbc5cdeb

📥 Commits

Reviewing files that changed from the base of the PR and between 1a1da79 and 8f9e4f2.

📒 Files selected for processing (4)
  • docs/SIMPLIFICATION_REPORT.md
  • docs/plans/2026-03-04-simplify-observability.md
  • scripts/run_evals.py
  • src/d4bl/evals/runner.py

Comment thread docs/SIMPLIFICATION_REPORT.md
Comment thread docs/SIMPLIFICATION_REPORT.md Outdated
Comment thread src/d4bl/evals/runner.py Outdated
- Resolve merge conflict in SIMPLIFICATION_REPORT.md (take main's observability metrics)
- Fix table column mismatch: move "Key changes" rows out of tables into text
- Add zero-check for division in example script
- Break long line in runner.py to stay within 100-char limit

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/evals/runner.py`:
- Around line 32-45: Normalize and validate research_data and findings before
building evaluator kwargs: ensure job.research_data is a dict (otherwise set
research_data = {}) and that research_findings is a list (otherwise set findings
= []), then use these sanitized variables when populating "sources" and
"extracted_contents"; also when constructing extracted_contents ensure each item
f is a dict and safely read f.get("url","") and f.get("content","") and skip
non-dict entries so the list comprehension in extracted_contents cannot raise
AttributeError/TypeError (refer to symbols research_data, findings,
job.research_data, research_findings, and extracted_contents).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3a219ed-3511-4c5a-86f3-467f3f52ec6e

📥 Commits

Reviewing files that changed from the base of the PR and between 8f9e4f2 and 1a2c41d.

📒 Files selected for processing (2)
  • docs/SIMPLIFICATION_REPORT.md
  • src/d4bl/evals/runner.py

Comment thread src/d4bl/evals/runner.py Outdated
Comment on lines +32 to +45
research_data = job.research_data or {}
findings = research_data.get("research_findings") or []

return {
"query": job.query,
"research_output": research_output,
"sources": research_data.get("source_urls", []),
"trace_id": job.trace_id or str(job.job_id),
"report": raw_result.get("report") if isinstance(raw_result, dict) else None,
"extracted_contents": [
{"url": f.get("url", ""), "content": f.get("content", "")}
for f in findings
if isinstance(f, dict) and "url" in f
] or None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden JSON shape handling before building evaluator kwargs.

Line 32 and Line 41 assume research_data/research_findings have dict/list shapes. If a row contains nonconforming JSON, this can raise AttributeError/TypeError and fail that job evaluation path.

🛠️ Proposed fix
-    research_data = job.research_data or {}
-    findings = research_data.get("research_findings") or []
+    research_data = job.research_data if isinstance(job.research_data, dict) else {}
+    source_urls_raw = research_data.get("source_urls")
+    source_urls = [u for u in source_urls_raw if isinstance(u, str)] if isinstance(
+        source_urls_raw, list
+    ) else []
+    findings_raw = research_data.get("research_findings")
+    findings = findings_raw if isinstance(findings_raw, list) else []
+    report_raw = raw_result.get("report") if isinstance(raw_result, dict) else None
+    report = report_raw if isinstance(report_raw, str) else None

     return {
         "query": job.query,
         "research_output": research_output,
-        "sources": research_data.get("source_urls", []),
+        "sources": source_urls,
         "trace_id": job.trace_id or str(job.job_id),
-        "report": raw_result.get("report") if isinstance(raw_result, dict) else None,
+        "report": report,
         "extracted_contents": [
-            {"url": f.get("url", ""), "content": f.get("content", "")}
+            {"url": str(f.get("url", "")), "content": str(f.get("content", ""))}
             for f in findings
-            if isinstance(f, dict) and "url" in f
+            if isinstance(f, dict) and f.get("url")
         ] or None,
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` around lines 32 - 45, Normalize and validate
research_data and findings before building evaluator kwargs: ensure
job.research_data is a dict (otherwise set research_data = {}) and that
research_findings is a list (otherwise set findings = []), then use these
sanitized variables when populating "sources" and "extracted_contents"; also
when constructing extracted_contents ensure each item f is a dict and safely
read f.get("url","") and f.get("content","") and skip non-dict entries so the
list comprehension in extracted_contents cannot raise AttributeError/TypeError
(refer to symbols research_data, findings, job.research_data, research_findings,
and extracted_contents).

Validate that research_data, source_urls, research_findings, and report
have expected types before accessing dict/list methods, preventing
AttributeError on nonconforming JSON rows.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/evals/runner.py`:
- Line 73: Validate the concurrency parameter before creating the semaphore:
check the variable concurrency and if it is not a positive integer (<= 0) raise
a clear ValueError (or coerce to a sensible minimum like 1) before executing sem
= asyncio.Semaphore(concurrency); this guard should be placed immediately before
the Semaphore creation so the code paths using sem (referencing sem and
asyncio.Semaphore) never get a zero-capacity semaphore.
- Line 11: The file binds async_session_maker at import time which becomes stale
after init_db() reassigns it; change the import to use the database module
(e.g., import d4bl.infra.database as database) and update any call sites in this
file that call async_session_maker() to use database.async_session_maker() (and
optionally reference ResearchJob via database.ResearchJob if you want
consistency) so the runtime uses the updated binding set by init_db().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5b747ecc-f728-4e85-8ea2-03c5bc10812f

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2c41d and 5ec8f20.

📒 Files selected for processing (1)
  • src/d4bl/evals/runner.py

Comment thread src/d4bl/evals/runner.py Outdated
Comment thread src/d4bl/evals/runner.py
…rency

- Access async_session_maker via db_module to avoid stale None binding
  from import-time snapshot (async_session_maker is assigned in init_db)
- Reject concurrency < 1 to prevent Semaphore(0) deadlock

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/d4bl/evals/runner.py (3)

52-56: ⚠️ Potential issue | 🟡 Minor

Type safety: url and content values may not be strings.

The f.get("url", "") and f.get("content", "") calls return whatever type is stored in the dict. Per the expected signature, extracted_contents should contain string values.

🛠️ Proposed fix
         "extracted_contents": [
-            {"url": f.get("url", ""), "content": f.get("content", "")}
+            {"url": str(f.get("url", "")), "content": str(f.get("content", ""))}
             for f in findings
             if isinstance(f, dict) and "url" in f
         ] or None,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` around lines 52 - 56, The list comprehension that
builds "extracted_contents" uses f.get("url", "") and f.get("content", "") which
may return non-string types; update the comprehension to coerce both values to
strings (and normalize None to empty string) before adding them so the resulting
entries always match the expected string signature. Locate the comprehension
that produces "extracted_contents" (using variables findings and
f.get("url"/"content")) and replace the direct f.get calls with a small in-line
normalization such as url = f.get("url", "") and content = f.get("content", ""),
then set url = "" if url is None else str(url) and content = "" if content is
None else str(content) (or equivalent) so every {"url": ..., "content": ...}
entry contains strings.

92-92: ⚠️ Potential issue | 🟡 Minor

Guard against uninitialized async_session_maker.

If init_db() fails silently or is called in a context where initialization doesn't complete, db_module.async_session_maker remains None, causing a TypeError when called.

🛠️ Proposed fix
     init_db()
 
+    if db_module.async_session_maker is None:
+        raise RuntimeError("Database session maker not initialized")
     if concurrency < 1:
         raise ValueError("concurrency must be >= 1")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` at line 92, The code uses
db_module.async_session_maker without ensuring it was initialized; guard the
usage in the function containing "async with db_module.async_session_maker() as
db:" by checking that db_module.async_session_maker is not None and is callable
(or initialize it) before calling; if it's missing, raise a clear RuntimeError
(or call await db_module.init_db() if initialization is safe here) with a
message referencing init_db and async_session_maker so callers know to
initialize the DB first.

38-39: ⚠️ Potential issue | 🟡 Minor

Type safety: sources list may contain non-string elements.

run_comprehensive_evaluation expects sources: List[str], but source_urls from the DB may contain non-string items. Consider filtering or coercing to strings.

🛠️ Proposed fix
     sources_raw = research_data.get("source_urls")
-    sources = sources_raw if isinstance(sources_raw, list) else []
+    sources = [
+        str(s) for s in sources_raw if s is not None
+    ] if isinstance(sources_raw, list) else []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` around lines 38 - 39, The sources list may include
non-string items from research_data.get("source_urls"); update the extraction
around sources_raw/sources in run_comprehensive_evaluation to produce a
List[str] by filtering out non-string entries and coercing allowed types (e.g.,
numbers) to strings: read research_data.get("source_urls") into sources_raw,
then build sources = [str(s) for s in sources_raw if s is not None and
(isinstance(s, str) or is_scalar_coercible(s))] or similar, ensuring only valid
string items are passed to run_comprehensive_evaluation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/d4bl/evals/runner.py`:
- Around line 52-56: The list comprehension that builds "extracted_contents"
uses f.get("url", "") and f.get("content", "") which may return non-string
types; update the comprehension to coerce both values to strings (and normalize
None to empty string) before adding them so the resulting entries always match
the expected string signature. Locate the comprehension that produces
"extracted_contents" (using variables findings and f.get("url"/"content")) and
replace the direct f.get calls with a small in-line normalization such as url =
f.get("url", "") and content = f.get("content", ""), then set url = "" if url is
None else str(url) and content = "" if content is None else str(content) (or
equivalent) so every {"url": ..., "content": ...} entry contains strings.
- Line 92: The code uses db_module.async_session_maker without ensuring it was
initialized; guard the usage in the function containing "async with
db_module.async_session_maker() as db:" by checking that
db_module.async_session_maker is not None and is callable (or initialize it)
before calling; if it's missing, raise a clear RuntimeError (or call await
db_module.init_db() if initialization is safe here) with a message referencing
init_db and async_session_maker so callers know to initialize the DB first.
- Around line 38-39: The sources list may include non-string items from
research_data.get("source_urls"); update the extraction around
sources_raw/sources in run_comprehensive_evaluation to produce a List[str] by
filtering out non-string entries and coercing allowed types (e.g., numbers) to
strings: read research_data.get("source_urls") into sources_raw, then build
sources = [str(s) for s in sources_raw if s is not None and (isinstance(s, str)
or is_scalar_coercible(s))] or similar, ensuring only valid string items are
passed to run_comprehensive_evaluation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7815f5c7-fc24-4cc1-a86b-9d5947d0d816

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec8f20 and d2d9ea2.

📒 Files selected for processing (1)
  • src/d4bl/evals/runner.py

@William-Hill William-Hill merged commit 59fc487 into main Mar 5, 2026
@William-Hill William-Hill deleted the simplify/evals branch March 5, 2026 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify: evals/ module

1 participant