Simplify: evals/ module#42
Conversation
Closes #25 — addresses 11 findings across code reuse, quality, and efficiency in src/d4bl/observability/.
Tracks LOC, cyclomatic complexity, and maintainability index before/after each module simplification PR. Includes per-module baselines for remaining work and "how to regenerate" instructions.
- Remove dead fallback keys (answer, text, output) from result extraction - Pass actual source_urls, report, and extracted_contents to evaluations - Remove unused parameters (eval_types, interactive, output_csv_path) - Fix mixed imports, remove unused Path import, use modern type hints - Filter NULL results in SQL, add ORDER BY for deterministic LIMIT - Keep session open during asyncio.gather to prevent detached ORM objects - Add full traceback logging for evaluation failures
📝 WalkthroughWalkthroughRefactors batch evaluation flow: removes three CLI options, centralizes job result extraction via Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Main/CLI
participant Runner as run_evals_and_log()
participant DB as Database
participant Jobs as ResearchJob list
participant Extract as _extract_eval_inputs()
participant ThreadPool as ThreadPoolExecutor
participant Eval as run_comprehensive_evaluation()
Main->>Runner: call(max_rows, concurrency, selected_job_ids)
Runner->>DB: query completed jobs where result IS NOT NULL,<br/>filter by selected_job_ids, ORDER BY created_at DESC, LIMIT max_rows
DB-->>Jobs: return ResearchJob instances
Runner->>Jobs: iterate and spawn tasks (asyncio.gather)
loop for each job (concurrent)
Jobs->>Extract: _extract_eval_inputs(job)
Extract-->>Runner: inputs dict or None
alt inputs present
Runner->>ThreadPool: submit run_comprehensive_evaluation(inputs)
ThreadPool->>Eval: execute evaluation
Eval-->>ThreadPool: evaluation result
ThreadPool-->>Runner: result (includes trace_id)
Runner->>Main: log/store evaluation result
else inputs missing
Runner->>Main: log warning (no usable result)
end
end
Runner-->>Main: complete (errors logged with tracebacks if any)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/SIMPLIFICATION_REPORT.md`:
- Around line 164-166: The script at the end uses `count` and `total_cc` to
print an average but can divide by zero if `count` is 0; update the final
printing logic to check `count == 0` before computing `total_cc / count` (the
`count` and `total_cc` variables and the two print statements are the relevant
symbols), and when `count` is zero print a sensible message or show 0.0 for the
average instead of performing the division to avoid a ZeroDivisionError.
- Around line 31-35: The tables under the "Completed Modules" section use a
4-column header ("| Metric | Before | After | Delta |") but data rows lack
values for Before/After/Delta; update each affected table row to either supply
concrete Before, After, and Delta values for every Metric row (e.g., fill
numeric LOC or % deltas) or simplify the table to a two-column format ("Metric |
Notes") and adjust the header and all rows accordingly; ensure every table under
"Completed Modules" uses the same consistent column count as the header
(reference the header string "| Metric | Before | After | Delta |" and the
metric rows currently under that section) so Markdown renders correctly.
In `@src/d4bl/evals/runner.py`:
- Line 24: The assignment to research_output in runner.py is over the 100-char
limit; split the expression across multiple lines to keep each line ≤100 chars
by extracting the conditional check into a short temporary or by using a
two-step assignment: first compute a safe_raw = raw_result.get("raw_output", "")
if isinstance(raw_result, dict) else "" and then set research_output =
str(safe_raw).strip(); reference the research_output variable and raw_result
check in your edit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 08104100-3ed9-4baf-8fdd-76b4fbc5cdeb
📒 Files selected for processing (4)
docs/SIMPLIFICATION_REPORT.mddocs/plans/2026-03-04-simplify-observability.mdscripts/run_evals.pysrc/d4bl/evals/runner.py
- Resolve merge conflict in SIMPLIFICATION_REPORT.md (take main's observability metrics) - Fix table column mismatch: move "Key changes" rows out of tables into text - Add zero-check for division in example script - Break long line in runner.py to stay within 100-char limit
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/evals/runner.py`:
- Around line 32-45: Normalize and validate research_data and findings before
building evaluator kwargs: ensure job.research_data is a dict (otherwise set
research_data = {}) and that research_findings is a list (otherwise set findings
= []), then use these sanitized variables when populating "sources" and
"extracted_contents"; also when constructing extracted_contents ensure each item
f is a dict and safely read f.get("url","") and f.get("content","") and skip
non-dict entries so the list comprehension in extracted_contents cannot raise
AttributeError/TypeError (refer to symbols research_data, findings,
job.research_data, research_findings, and extracted_contents).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a3a219ed-3511-4c5a-86f3-467f3f52ec6e
📒 Files selected for processing (2)
docs/SIMPLIFICATION_REPORT.mdsrc/d4bl/evals/runner.py
| research_data = job.research_data or {} | ||
| findings = research_data.get("research_findings") or [] | ||
|
|
||
| return { | ||
| "query": job.query, | ||
| "research_output": research_output, | ||
| "sources": research_data.get("source_urls", []), | ||
| "trace_id": job.trace_id or str(job.job_id), | ||
| "report": raw_result.get("report") if isinstance(raw_result, dict) else None, | ||
| "extracted_contents": [ | ||
| {"url": f.get("url", ""), "content": f.get("content", "")} | ||
| for f in findings | ||
| if isinstance(f, dict) and "url" in f | ||
| ] or None, |
There was a problem hiding this comment.
Harden JSON shape handling before building evaluator kwargs.
Line 32 and Line 41 assume research_data/research_findings have dict/list shapes. If a row contains nonconforming JSON, this can raise AttributeError/TypeError and fail that job evaluation path.
🛠️ Proposed fix
- research_data = job.research_data or {}
- findings = research_data.get("research_findings") or []
+ research_data = job.research_data if isinstance(job.research_data, dict) else {}
+ source_urls_raw = research_data.get("source_urls")
+ source_urls = [u for u in source_urls_raw if isinstance(u, str)] if isinstance(
+ source_urls_raw, list
+ ) else []
+ findings_raw = research_data.get("research_findings")
+ findings = findings_raw if isinstance(findings_raw, list) else []
+ report_raw = raw_result.get("report") if isinstance(raw_result, dict) else None
+ report = report_raw if isinstance(report_raw, str) else None
return {
"query": job.query,
"research_output": research_output,
- "sources": research_data.get("source_urls", []),
+ "sources": source_urls,
"trace_id": job.trace_id or str(job.job_id),
- "report": raw_result.get("report") if isinstance(raw_result, dict) else None,
+ "report": report,
"extracted_contents": [
- {"url": f.get("url", ""), "content": f.get("content", "")}
+ {"url": str(f.get("url", "")), "content": str(f.get("content", ""))}
for f in findings
- if isinstance(f, dict) and "url" in f
+ if isinstance(f, dict) and f.get("url")
] or None,
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/d4bl/evals/runner.py` around lines 32 - 45, Normalize and validate
research_data and findings before building evaluator kwargs: ensure
job.research_data is a dict (otherwise set research_data = {}) and that
research_findings is a list (otherwise set findings = []), then use these
sanitized variables when populating "sources" and "extracted_contents"; also
when constructing extracted_contents ensure each item f is a dict and safely
read f.get("url","") and f.get("content","") and skip non-dict entries so the
list comprehension in extracted_contents cannot raise AttributeError/TypeError
(refer to symbols research_data, findings, job.research_data, research_findings,
and extracted_contents).
Validate that research_data, source_urls, research_findings, and report have expected types before accessing dict/list methods, preventing AttributeError on nonconforming JSON rows.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/evals/runner.py`:
- Line 73: Validate the concurrency parameter before creating the semaphore:
check the variable concurrency and if it is not a positive integer (<= 0) raise
a clear ValueError (or coerce to a sensible minimum like 1) before executing sem
= asyncio.Semaphore(concurrency); this guard should be placed immediately before
the Semaphore creation so the code paths using sem (referencing sem and
asyncio.Semaphore) never get a zero-capacity semaphore.
- Line 11: The file binds async_session_maker at import time which becomes stale
after init_db() reassigns it; change the import to use the database module
(e.g., import d4bl.infra.database as database) and update any call sites in this
file that call async_session_maker() to use database.async_session_maker() (and
optionally reference ResearchJob via database.ResearchJob if you want
consistency) so the runtime uses the updated binding set by init_db().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5b747ecc-f728-4e85-8ea2-03c5bc10812f
📒 Files selected for processing (1)
src/d4bl/evals/runner.py
…rency - Access async_session_maker via db_module to avoid stale None binding from import-time snapshot (async_session_maker is assigned in init_db) - Reject concurrency < 1 to prevent Semaphore(0) deadlock
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/d4bl/evals/runner.py (3)
52-56:⚠️ Potential issue | 🟡 MinorType safety:
urlandcontentvalues may not be strings.The
f.get("url", "")andf.get("content", "")calls return whatever type is stored in the dict. Per the expected signature,extracted_contentsshould contain string values.🛠️ Proposed fix
"extracted_contents": [ - {"url": f.get("url", ""), "content": f.get("content", "")} + {"url": str(f.get("url", "")), "content": str(f.get("content", ""))} for f in findings if isinstance(f, dict) and "url" in f ] or None,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/evals/runner.py` around lines 52 - 56, The list comprehension that builds "extracted_contents" uses f.get("url", "") and f.get("content", "") which may return non-string types; update the comprehension to coerce both values to strings (and normalize None to empty string) before adding them so the resulting entries always match the expected string signature. Locate the comprehension that produces "extracted_contents" (using variables findings and f.get("url"/"content")) and replace the direct f.get calls with a small in-line normalization such as url = f.get("url", "") and content = f.get("content", ""), then set url = "" if url is None else str(url) and content = "" if content is None else str(content) (or equivalent) so every {"url": ..., "content": ...} entry contains strings.
92-92:⚠️ Potential issue | 🟡 MinorGuard against uninitialized
async_session_maker.If
init_db()fails silently or is called in a context where initialization doesn't complete,db_module.async_session_makerremainsNone, causing aTypeErrorwhen called.🛠️ Proposed fix
init_db() + if db_module.async_session_maker is None: + raise RuntimeError("Database session maker not initialized") if concurrency < 1: raise ValueError("concurrency must be >= 1")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/evals/runner.py` at line 92, The code uses db_module.async_session_maker without ensuring it was initialized; guard the usage in the function containing "async with db_module.async_session_maker() as db:" by checking that db_module.async_session_maker is not None and is callable (or initialize it) before calling; if it's missing, raise a clear RuntimeError (or call await db_module.init_db() if initialization is safe here) with a message referencing init_db and async_session_maker so callers know to initialize the DB first.
38-39:⚠️ Potential issue | 🟡 MinorType safety:
sourceslist may contain non-string elements.
run_comprehensive_evaluationexpectssources: List[str], butsource_urlsfrom the DB may contain non-string items. Consider filtering or coercing to strings.🛠️ Proposed fix
sources_raw = research_data.get("source_urls") - sources = sources_raw if isinstance(sources_raw, list) else [] + sources = [ + str(s) for s in sources_raw if s is not None + ] if isinstance(sources_raw, list) else []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/evals/runner.py` around lines 38 - 39, The sources list may include non-string items from research_data.get("source_urls"); update the extraction around sources_raw/sources in run_comprehensive_evaluation to produce a List[str] by filtering out non-string entries and coercing allowed types (e.g., numbers) to strings: read research_data.get("source_urls") into sources_raw, then build sources = [str(s) for s in sources_raw if s is not None and (isinstance(s, str) or is_scalar_coercible(s))] or similar, ensuring only valid string items are passed to run_comprehensive_evaluation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/d4bl/evals/runner.py`:
- Around line 52-56: The list comprehension that builds "extracted_contents"
uses f.get("url", "") and f.get("content", "") which may return non-string
types; update the comprehension to coerce both values to strings (and normalize
None to empty string) before adding them so the resulting entries always match
the expected string signature. Locate the comprehension that produces
"extracted_contents" (using variables findings and f.get("url"/"content")) and
replace the direct f.get calls with a small in-line normalization such as url =
f.get("url", "") and content = f.get("content", ""), then set url = "" if url is
None else str(url) and content = "" if content is None else str(content) (or
equivalent) so every {"url": ..., "content": ...} entry contains strings.
- Line 92: The code uses db_module.async_session_maker without ensuring it was
initialized; guard the usage in the function containing "async with
db_module.async_session_maker() as db:" by checking that
db_module.async_session_maker is not None and is callable (or initialize it)
before calling; if it's missing, raise a clear RuntimeError (or call await
db_module.init_db() if initialization is safe here) with a message referencing
init_db and async_session_maker so callers know to initialize the DB first.
- Around line 38-39: The sources list may include non-string items from
research_data.get("source_urls"); update the extraction around
sources_raw/sources in run_comprehensive_evaluation to produce a List[str] by
filtering out non-string entries and coercing allowed types (e.g., numbers) to
strings: read research_data.get("source_urls") into sources_raw, then build
sources = [str(s) for s in sources_raw if s is not None and (isinstance(s, str)
or is_scalar_coercible(s))] or similar, ensuring only valid string items are
passed to run_comprehensive_evaluation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7815f5c7-fc24-4cc1-a86b-9d5947d0d816
📒 Files selected for processing (1)
src/d4bl/evals/runner.py
Summary
Closes #26
source_urls,report, andextracted_contentsto evaluations — batch evals now have parity with the live research patheval_types,interactive,output_csv_path) from runner + CLI; fix mixed imports; use modern type hintsORDER BY created_at DESCfor deterministicLIMIT; keep session open duringasyncio.gatherto prevent detached ORM objects; add full traceback logging for failuresComplexity Metrics
LOC and complexity increased because the module now does meaningful work it was previously skipping (passing real sources, report, and extracted contents to evaluators).
Test plan
python -c "from d4bl.evals.runner import run_evals_and_log"imports cleanlypython scripts/run_evals.py --helpshows updated CLI (no--eval-types,--interactive,--output-csv)python scripts/run_evals.py --max-rows 1against a DB with completed jobs to confirm evaluations execute with real sources/reportSummary by CodeRabbit
Bug Fixes
Documentation
New Features
Chores