Simplify: evals/ module by William-Hill · Pull Request #42 · William-Hill/d4bl_ai_agent

William-Hill · 2026-03-05T00:33:04Z

Summary

Closes #26

Code reuse (1.1–1.3): Remove dead fallback keys from result extraction; pass actual source_urls, report, and extracted_contents to evaluations — batch evals now have parity with the live research path
Code quality (2.1–2.4): Remove 3 unused parameters (eval_types, interactive, output_csv_path) from runner + CLI; fix mixed imports; use modern type hints
Efficiency (3.1–3.4): Filter NULL results in SQL; add ORDER BY created_at DESC for deterministic LIMIT; keep session open during asyncio.gather to prevent detached ORM objects; add full traceback logging for failures

Complexity Metrics

Metric	Before	After	Delta
Source LOC	100	113	+13
Source files	2	2	0
Functions/methods	1	2	+1
Avg cyclomatic complexity	7.0 (B)	9.5 (B)	+2.5
CC grade distribution	B:1	B:1, C:1
Avg maintainability index	85.0	64.75	-20.25

LOC and complexity increased because the module now does meaningful work it was previously skipping (passing real sources, report, and extracted contents to evaluators).

Test plan

Verify python -c "from d4bl.evals.runner import run_evals_and_log" imports cleanly
Verify python scripts/run_evals.py --help shows updated CLI (no --eval-types, --interactive, --output-csv)
Run python scripts/run_evals.py --max-rows 1 against a DB with completed jobs to confirm evaluations execute with real sources/report

Summary by CodeRabbit

Bug Fixes
- Fixed potential division-by-zero in average complexity output.
- Added warnings and improved error reporting for failed or empty evaluation jobs.
Documentation
- Improved report table layout and clarified "Key changes" formatting.
- Reworded CLI help/examples to reference "completed research jobs" and clarified max-rows description.
New Features
- Added explicit concurrency control for running evaluations.
Chores
- Removed deprecated CLI options and streamlined evaluation run options.

Closes #25 — addresses 11 findings across code reuse, quality, and efficiency in src/d4bl/observability/.

Tracks LOC, cyclomatic complexity, and maintainability index before/after each module simplification PR. Includes per-module baselines for remaining work and "how to regenerate" instructions.

- Remove dead fallback keys (answer, text, output) from result extraction - Pass actual source_urls, report, and extracted_contents to evaluations - Remove unused parameters (eval_types, interactive, output_csv_path) - Fix mixed imports, remove unused Path import, use modern type hints - Filter NULL results in SQL, add ORDER BY for deterministic LIMIT - Keep session open during asyncio.gather to prevent detached ORM objects - Add full traceback logging for evaluation failures

coderabbitai · 2026-03-05T00:33:24Z

📝 Walkthrough

Walkthrough

Refactors batch evaluation flow: removes three CLI options, centralizes job result extraction via _extract_eval_inputs(), updates run_evals_and_log() signature to add concurrency and tighten types, queries completed jobs with SQL-level filters and ordering, and runs evaluations concurrently with improved error and trace logging. Docs formatting and a division-by-zero guard were also adjusted.

Changes

Cohort / File(s)	Summary
Documentation & Metrics `docs/SIMPLIFICATION_REPORT.md`	Reformatted the metrics table (delta values moved to a Delta column), converted per-module "Key changes" rows into bold-prefixed inline text, and added a division-by-zero guard when printing average complexity.
CLI Simplification `scripts/run_evals.py`	Reworded CLI description/examples to reference "jobs"; removed `--eval-types`, `--interactive`, and `--output-csv` options and related parsing; updated `main()` to pass only `max_rows`, `concurrency`, and `job_ids` to the runner.
Eval Runner Refactor `src/d4bl/evals/runner.py`	Added `_extract_eval_inputs(job: ResearchJob) -> dict

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main/CLI
    participant Runner as run_evals_and_log()
    participant DB as Database
    participant Jobs as ResearchJob list
    participant Extract as _extract_eval_inputs()
    participant ThreadPool as ThreadPoolExecutor
    participant Eval as run_comprehensive_evaluation()

    Main->>Runner: call(max_rows, concurrency, selected_job_ids)
    Runner->>DB: query completed jobs where result IS NOT NULL,<br/>filter by selected_job_ids, ORDER BY created_at DESC, LIMIT max_rows
    DB-->>Jobs: return ResearchJob instances
    Runner->>Jobs: iterate and spawn tasks (asyncio.gather)
    loop for each job (concurrent)
        Jobs->>Extract: _extract_eval_inputs(job)
        Extract-->>Runner: inputs dict or None
        alt inputs present
            Runner->>ThreadPool: submit run_comprehensive_evaluation(inputs)
            ThreadPool->>Eval: execute evaluation
            Eval-->>ThreadPool: evaluation result
            ThreadPool-->>Runner: result (includes trace_id)
            Runner->>Main: log/store evaluation result
        else inputs missing
            Runner->>Main: log warning (no usable result)
        end
    end
    Runner-->>Main: complete (errors logged with tracebacks if any)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

refactor: Simplify services/ module — base evaluator, dead code cleanup #30 — Refactor of evaluation runner and run_comprehensive_evaluation logic; strongly related to changes in src/d4bl/evals/runner.py.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main objective: simplifying the evals module by removing unused parameters, fixing code reuse issues, and improving efficiency.
Linked Issues check	✅ Passed	The PR successfully addresses all primary and secondary objectives from issue `#26`: removes dead fallback keys, passes real source_urls/report/extracted_contents, removes unused parameters, normalizes imports, updates type hints, filters NULL results in SQL, adds ORDER BY determinism, keeps DB session open during async operations, and preserves full tracebacks.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the objectives outlined in issue `#26`; the documentation update to SIMPLIFICATION_REPORT.md is supporting context for the code changes and not out of scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch simplify/evals

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/SIMPLIFICATION_REPORT.md`:
- Around line 164-166: The script at the end uses `count` and `total_cc` to
print an average but can divide by zero if `count` is 0; update the final
printing logic to check `count == 0` before computing `total_cc / count` (the
`count` and `total_cc` variables and the two print statements are the relevant
symbols), and when `count` is zero print a sensible message or show 0.0 for the
average instead of performing the division to avoid a ZeroDivisionError.
- Around line 31-35: The tables under the "Completed Modules" section use a
4-column header ("| Metric | Before | After | Delta |") but data rows lack
values for Before/After/Delta; update each affected table row to either supply
concrete Before, After, and Delta values for every Metric row (e.g., fill
numeric LOC or % deltas) or simplify the table to a two-column format ("Metric |
Notes") and adjust the header and all rows accordingly; ensure every table under
"Completed Modules" uses the same consistent column count as the header
(reference the header string "| Metric | Before | After | Delta |" and the
metric rows currently under that section) so Markdown renders correctly.

In `@src/d4bl/evals/runner.py`:
- Line 24: The assignment to research_output in runner.py is over the 100-char
limit; split the expression across multiple lines to keep each line ≤100 chars
by extracting the conditional check into a short temporary or by using a
two-step assignment: first compute a safe_raw = raw_result.get("raw_output", "")
if isinstance(raw_result, dict) else "" and then set research_output =
str(safe_raw).strip(); reference the research_output variable and raw_result
check in your edit.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 08104100-3ed9-4baf-8fdd-76b4fbc5cdeb

📥 Commits

Reviewing files that changed from the base of the PR and between 1a1da79 and 8f9e4f2.

📒 Files selected for processing (4)

docs/SIMPLIFICATION_REPORT.md
docs/plans/2026-03-04-simplify-observability.md
scripts/run_evals.py
src/d4bl/evals/runner.py

- Resolve merge conflict in SIMPLIFICATION_REPORT.md (take main's observability metrics) - Fix table column mismatch: move "Key changes" rows out of tables into text - Add zero-check for division in example script - Break long line in runner.py to stay within 100-char limit

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/evals/runner.py`:
- Around line 32-45: Normalize and validate research_data and findings before
building evaluator kwargs: ensure job.research_data is a dict (otherwise set
research_data = {}) and that research_findings is a list (otherwise set findings
= []), then use these sanitized variables when populating "sources" and
"extracted_contents"; also when constructing extracted_contents ensure each item
f is a dict and safely read f.get("url","") and f.get("content","") and skip
non-dict entries so the list comprehension in extracted_contents cannot raise
AttributeError/TypeError (refer to symbols research_data, findings,
job.research_data, research_findings, and extracted_contents).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3a219ed-3511-4c5a-86f3-467f3f52ec6e

📥 Commits

Reviewing files that changed from the base of the PR and between 8f9e4f2 and 1a2c41d.

📒 Files selected for processing (2)

docs/SIMPLIFICATION_REPORT.md
src/d4bl/evals/runner.py

coderabbitai · 2026-03-05T01:01:23Z

+    research_data = job.research_data or {}
+    findings = research_data.get("research_findings") or []
+
+    return {
+        "query": job.query,
+        "research_output": research_output,
+        "sources": research_data.get("source_urls", []),
+        "trace_id": job.trace_id or str(job.job_id),
+        "report": raw_result.get("report") if isinstance(raw_result, dict) else None,
+        "extracted_contents": [
+            {"url": f.get("url", ""), "content": f.get("content", "")}
+            for f in findings
+            if isinstance(f, dict) and "url" in f
+        ] or None,


⚠️ Potential issue | 🟠 Major

Harden JSON shape handling before building evaluator kwargs.

Line 32 and Line 41 assume research_data/research_findings have dict/list shapes. If a row contains nonconforming JSON, this can raise AttributeError/TypeError and fail that job evaluation path.

🛠️ Proposed fix

- research_data = job.research_data or {} - findings = research_data.get("research_findings") or [] + research_data = job.research_data if isinstance(job.research_data, dict) else {} + source_urls_raw = research_data.get("source_urls") + source_urls = [u for u in source_urls_raw if isinstance(u, str)] if isinstance( + source_urls_raw, list + ) else [] + findings_raw = research_data.get("research_findings") + findings = findings_raw if isinstance(findings_raw, list) else [] + report_raw = raw_result.get("report") if isinstance(raw_result, dict) else None + report = report_raw if isinstance(report_raw, str) else None return { "query": job.query, "research_output": research_output, - "sources": research_data.get("source_urls", []), + "sources": source_urls, "trace_id": job.trace_id or str(job.job_id), - "report": raw_result.get("report") if isinstance(raw_result, dict) else None, + "report": report, "extracted_contents": [ - {"url": f.get("url", ""), "content": f.get("content", "")} + {"url": str(f.get("url", "")), "content": str(f.get("content", ""))} for f in findings - if isinstance(f, dict) and "url" in f + if isinstance(f, dict) and f.get("url") ] or None, }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/d4bl/evals/runner.py` around lines 32 - 45, Normalize and validate research_data and findings before building evaluator kwargs: ensure job.research_data is a dict (otherwise set research_data = {}) and that research_findings is a list (otherwise set findings = []), then use these sanitized variables when populating "sources" and "extracted_contents"; also when constructing extracted_contents ensure each item f is a dict and safely read f.get("url","") and f.get("content","") and skip non-dict entries so the list comprehension in extracted_contents cannot raise AttributeError/TypeError (refer to symbols research_data, findings, job.research_data, research_findings, and extracted_contents).

Validate that research_data, source_urls, research_findings, and report have expected types before accessing dict/list methods, preventing AttributeError on nonconforming JSON rows.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/evals/runner.py`:
- Line 73: Validate the concurrency parameter before creating the semaphore:
check the variable concurrency and if it is not a positive integer (<= 0) raise
a clear ValueError (or coerce to a sensible minimum like 1) before executing sem
= asyncio.Semaphore(concurrency); this guard should be placed immediately before
the Semaphore creation so the code paths using sem (referencing sem and
asyncio.Semaphore) never get a zero-capacity semaphore.
- Line 11: The file binds async_session_maker at import time which becomes stale
after init_db() reassigns it; change the import to use the database module
(e.g., import d4bl.infra.database as database) and update any call sites in this
file that call async_session_maker() to use database.async_session_maker() (and
optionally reference ResearchJob via database.ResearchJob if you want
consistency) so the runtime uses the updated binding set by init_db().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5b747ecc-f728-4e85-8ea2-03c5bc10812f

📥 Commits

Reviewing files that changed from the base of the PR and between 1a2c41d and 5ec8f20.

📒 Files selected for processing (1)

src/d4bl/evals/runner.py

…rency - Access async_session_maker via db_module to avoid stale None binding from import-time snapshot (async_session_maker is assigned in init_db) - Reject concurrency < 1 to prevent Semaphore(0) deadlock

coderabbitai

♻️ Duplicate comments (3)

src/d4bl/evals/runner.py (3)

52-56: ⚠️ Potential issue | 🟡 Minor

Type safety: url and content values may not be strings.

The f.get("url", "") and f.get("content", "") calls return whatever type is stored in the dict. Per the expected signature, extracted_contents should contain string values.

🛠️ Proposed fix

         "extracted_contents": [
-            {"url": f.get("url", ""), "content": f.get("content", "")}
+            {"url": str(f.get("url", "")), "content": str(f.get("content", ""))}
             for f in findings
             if isinstance(f, dict) and "url" in f
         ] or None,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` around lines 52 - 56, The list comprehension that
builds "extracted_contents" uses f.get("url", "") and f.get("content", "") which
may return non-string types; update the comprehension to coerce both values to
strings (and normalize None to empty string) before adding them so the resulting
entries always match the expected string signature. Locate the comprehension
that produces "extracted_contents" (using variables findings and
f.get("url"/"content")) and replace the direct f.get calls with a small in-line
normalization such as url = f.get("url", "") and content = f.get("content", ""),
then set url = "" if url is None else str(url) and content = "" if content is
None else str(content) (or equivalent) so every {"url": ..., "content": ...}
entry contains strings.

92-92: ⚠️ Potential issue | 🟡 Minor

Guard against uninitialized async_session_maker.

If init_db() fails silently or is called in a context where initialization doesn't complete, db_module.async_session_maker remains None, causing a TypeError when called.

🛠️ Proposed fix

     init_db()
 
+    if db_module.async_session_maker is None:
+        raise RuntimeError("Database session maker not initialized")
     if concurrency < 1:
         raise ValueError("concurrency must be >= 1")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` at line 92, The code uses
db_module.async_session_maker without ensuring it was initialized; guard the
usage in the function containing "async with db_module.async_session_maker() as
db:" by checking that db_module.async_session_maker is not None and is callable
(or initialize it) before calling; if it's missing, raise a clear RuntimeError
(or call await db_module.init_db() if initialization is safe here) with a
message referencing init_db and async_session_maker so callers know to
initialize the DB first.

38-39: ⚠️ Potential issue | 🟡 Minor

Type safety: sources list may contain non-string elements.

run_comprehensive_evaluation expects sources: List[str], but source_urls from the DB may contain non-string items. Consider filtering or coercing to strings.

🛠️ Proposed fix

     sources_raw = research_data.get("source_urls")
-    sources = sources_raw if isinstance(sources_raw, list) else []
+    sources = [
+        str(s) for s in sources_raw if s is not None
+    ] if isinstance(sources_raw, list) else []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/evals/runner.py` around lines 38 - 39, The sources list may include
non-string items from research_data.get("source_urls"); update the extraction
around sources_raw/sources in run_comprehensive_evaluation to produce a
List[str] by filtering out non-string entries and coercing allowed types (e.g.,
numbers) to strings: read research_data.get("source_urls") into sources_raw,
then build sources = [str(s) for s in sources_raw if s is not None and
(isinstance(s, str) or is_scalar_coercible(s))] or similar, ensuring only valid
string items are passed to run_comprehensive_evaluation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/d4bl/evals/runner.py`:
- Around line 52-56: The list comprehension that builds "extracted_contents"
uses f.get("url", "") and f.get("content", "") which may return non-string
types; update the comprehension to coerce both values to strings (and normalize
None to empty string) before adding them so the resulting entries always match
the expected string signature. Locate the comprehension that produces
"extracted_contents" (using variables findings and f.get("url"/"content")) and
replace the direct f.get calls with a small in-line normalization such as url =
f.get("url", "") and content = f.get("content", ""), then set url = "" if url is
None else str(url) and content = "" if content is None else str(content) (or
equivalent) so every {"url": ..., "content": ...} entry contains strings.
- Line 92: The code uses db_module.async_session_maker without ensuring it was
initialized; guard the usage in the function containing "async with
db_module.async_session_maker() as db:" by checking that
db_module.async_session_maker is not None and is callable (or initialize it)
before calling; if it's missing, raise a clear RuntimeError (or call await
db_module.init_db() if initialization is safe here) with a message referencing
init_db and async_session_maker so callers know to initialize the DB first.
- Around line 38-39: The sources list may include non-string items from
research_data.get("source_urls"); update the extraction around
sources_raw/sources in run_comprehensive_evaluation to produce a List[str] by
filtering out non-string entries and coercing allowed types (e.g., numbers) to
strings: read research_data.get("source_urls") into sources_raw, then build
sources = [str(s) for s in sources_raw if s is not None and (isinstance(s, str)
or is_scalar_coercible(s))] or similar, ensuring only valid string items are
passed to run_comprehensive_evaluation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7815f5c7-fc24-4cc1-a86b-9d5947d0d816

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec8f20 and d2d9ea2.

📒 Files selected for processing (1)

src/d4bl/evals/runner.py

William-Hill added 3 commits March 4, 2026 13:49

docs: add plan for simplifying observability module

71d54c9

Closes #25 — addresses 11 findings across code reuse, quality, and efficiency in src/d4bl/observability/.

docs: add ongoing simplification metrics report

111f05c

Tracks LOC, cyclomatic complexity, and maintainability index before/after each module simplification PR. Includes per-module baselines for remaining work and "how to regenerate" instructions.

coderabbitai Bot requested changes Mar 5, 2026

View reviewed changes

Comment thread docs/SIMPLIFICATION_REPORT.md

Comment thread docs/SIMPLIFICATION_REPORT.md Outdated

Comment thread src/d4bl/evals/runner.py Outdated

coderabbitai Bot requested changes Mar 5, 2026

View reviewed changes

fix: harden JSON shape checks in _extract_eval_inputs

5ec8f20

Validate that research_data, source_urls, research_findings, and report have expected types before accessing dict/list methods, preventing AttributeError on nonconforming JSON rows.

coderabbitai Bot requested changes Mar 5, 2026

View reviewed changes

Comment thread src/d4bl/evals/runner.py Outdated

Comment thread src/d4bl/evals/runner.py

fix: use module-level access for async_session_maker, validate concur…

d2d9ea2

…rency - Access async_session_maker via db_module to avoid stale None binding from import-time snapshot (async_session_maker is assigned in init_db) - Reject concurrency < 1 to prevent Semaphore(0) deadlock

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

William-Hill merged commit 59fc487 into main Mar 5, 2026

William-Hill deleted the simplify/evals branch March 5, 2026 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify: evals/ module#42

Simplify: evals/ module#42
William-Hill merged 6 commits into
mainfrom
simplify/evals

William-Hill commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

William-Hill commented Mar 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Complexity Metrics

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

William-Hill commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading