refactor: Simplify services/ module — dedup, enums, parallel evals by William-Hill · Pull Request #32 · William-Hill/d4bl_ai_agent

William-Hill · 2026-03-03T14:38:57Z

Summary

Continues issue #19 (services/ module simplification) with Phases 3-5, completing 8 more findings (17/20 total).

Phase 3 — Safe/Quick wins:

1.3: Extract duplicated _keyword_relevance to parsers.keyword_relevance() (was in 3 files)
1.4: Verified JSON parse dedup is complete — no action needed
2.7: Remove evals.py compatibility shim, update import in research_runner.py
3.7: Centralize Docker detection via Settings.is_docker (replaces os.path.exists("/.dockerenv") in client.py + observability/langfuse.py)

Phase 4 — Low risk:

2.3: Add EvalStatus(str, Enum) replacing stringly-typed status values across all evaluators and runner
2.6: Extract notify_progress() helper deduplicating 3 identical set_status + send_websocket_update blocks (~30 LOC saved)
1.1: Docker detection now shared; singletons kept separate (different architectural purposes)

Phase 5 — Parallel evaluations:

3.1: Replace sequential evaluation loop with ThreadPoolExecutor — all evaluations (including optional content/report relevance) run concurrently

Intentionally deferred (3 findings):

1.2: Retry reuse — call_llm_text uses linear delay vs retry_with_backoff's exponential; different semantics, low ROI
2.5: Parameter sprawl in update_job_status — single caller, params mirror DB columns
3.4: Parallel URL evals within content_relevance.py — marginal gain within single evaluator

Test plan

All 52 tests pass after each phase
EvalStatus extends str so existing string comparisons remain backward-compatible
Manual verification of parallel evaluations with live Ollama (thread safety of LLM HTTP calls)

Closes #19

Summary by CodeRabbit

New Features
- Parallelized evaluations for faster, concurrent processing.
- Runtime Docker detection via settings to auto-adjust behavior.
Improvements
- Standardized evaluation statuses with a shared enum (SUCCESS, FAILED, SKIPPED, PARTIAL_SUCCESS).
- Centralized keyword relevance scoring via a shared parser utility.
Changes
- Consolidated progress notifications for clearer job-stage reporting.

Phase 3-5 of issue #19 code simplification: - Extract duplicated _keyword_relevance to parsers.keyword_relevance() - Remove evals.py compatibility shim (update import in research_runner) - Centralize Docker detection via Settings.is_docker - Add EvalStatus(str, Enum) replacing stringly-typed status values - Parallelize evaluations with ThreadPoolExecutor (finding 3.1) - Extract notify_progress() helper deduplicating 3 identical blocks - Verify JSON parse dedup is complete (finding 1.4 — no action needed) Intentionally deferred: 1.2 (retry reuse — different semantics), 2.5 (param sprawl — single caller), 3.4 (parallel URL evals — marginal) 17/20 findings now addressed for services/ module.

coderabbitai · 2026-03-03T14:39:14Z

📝 Walkthrough

Walkthrough

Centralizes Docker detection in Settings.is_docker, introduces an EvalStatus enum and replaces string statuses across langfuse evaluators, adds a keyword_relevance parser, parallelizes evaluations in the runner with ThreadPoolExecutor, removes a redundant evals shim, and centralizes progress notifications in research_runner.

Changes

Cohort / File(s)	Summary
Settings & Observability `src/d4bl/settings.py`, `src/d4bl/observability/langfuse.py`	Adds `Settings.is_docker` and switches Docker checks to use it (centralizes environment detection).
Langfuse Client `src/d4bl/services/langfuse/client.py`	Replaces filesystem Docker check (`/.dockerenv`) with `get_settings().is_docker` when adjusting host for Docker environments.
EvalStatus Enum & Export `src/d4bl/services/langfuse/_base.py`, `src/d4bl/services/langfuse/__init__.py`	Adds `EvalStatus` (str Enum: `SUCCESS`, `FAILED`, `SKIPPED`, `PARTIAL_SUCCESS`) and exports it from the package.
Evaluator Status Standardization `src/d4bl/services/langfuse/quality.py`, `.../bias.py`, `.../hallucination.py`, `.../reference.py`, `.../content_relevance.py`, `.../source_relevance.py`, `.../report_relevance.py`	Replaces string status literals with `EvalStatus` enum values in return payloads and error paths across evaluators.
Parsers: keyword_relevance `src/d4bl/services/langfuse/parsers.py`	Adds `keyword_relevance(query, text) -> float` to consolidate keyword-overlap scoring and used as fallback in relevance evaluators.
Runner: Parallelization & Status Aggregation `src/d4bl/services/langfuse/runner.py`	Refactors sequential evaluator execution into data-driven parallel execution using `ThreadPoolExecutor`, adds `_run_eval`, unified `eval_specs`, timeout handling, and aggregates overall status using `EvalStatus`.
Compatibility shim removal `src/d4bl/services/langfuse/evals.py`	Removes the re-export shim file that duplicated the public API (deleted).
Research runner progress updates `src/d4bl/services/research_runner.py`	Introduces nested `notify_progress` helper to consolidate DB status + websocket updates and updates import to call runner's comprehensive evaluation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Runner as runner.py
    participant Executor as ThreadPoolExecutor
    participant Eval as Evaluator
    participant Langfuse as LangfuseClient

    Client->>Runner: run_comprehensive_evaluation(...)
    Runner->>Runner: build eval_specs (name, func, kwargs)
    Runner->>Executor: submit(_run_eval) for each spec
    par Parallel evaluations
        Executor->>Eval: execute evaluator function
        Eval->>Langfuse: call LLM / Langfuse client
        Langfuse-->>Eval: return evaluation result
        Eval-->>Executor: return result dict (with EvalStatus)
    end
    Executor->>Runner: collect results / handle exceptions / timeouts
    Runner->>Runner: aggregate EvalStatus -> overall status
    Runner-->>Client: return aggregated results with EvalStatus

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Epic: Code Simplification Sweep #29: Implements centralized Settings.is_docker and replaces string status values with an EvalStatus enum, addressing the issue's Docker-detection and status-typing objectives.

Possibly related PRs

feat: Add Explore Data page with interactive dashboard #15: Modifies Langfuse client initialization; related because both PRs change client.py Docker/host handling.
refactor: Simplify services/ module — base evaluator, dead code cleanup #30: Refactors evaluator helpers/parsers and runner logic; related due to overlapping changes in _base.py, parsers.py, and runner/evaluator consolidation.
refactor: Iteration 3 — Observability & hardening #14: Changes Settings dataclass; related because both PRs modify src/d4bl/settings.py (adding configuration fields).

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: deduplication, enum introduction, and parallelization of evaluations.
Linked Issues check	✅ Passed	All primary objectives from issue `#19` are addressed: deduplication via keyword_relevance extraction, enum introduction via EvalStatus, parallel evaluations via ThreadPoolExecutor, Docker detection centralization, and evals.py removal.
Out of Scope Changes check	✅ Passed	All changes align with issue `#19` objectives; no out-of-scope modifications detected beyond the intended code simplification and refactoring scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/simplify-services-phase3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/services/langfuse/_base.py`:
- Around line 12-17: Add a concise docstring to the public EvalStatus enum
describing its purpose and the meaning of each member (SUCCESS, FAILED, SKIPPED,
PARTIAL_SUCCESS) so users of the API understand what the status values
represent; modify the EvalStatus class definition to include this explanatory
string literal immediately under the class declaration.

In `@src/d4bl/services/langfuse/bias.py`:
- Around line 17-20: The return statements for validation of research_output and
query are too long; refactor the long dict literal returned when research_output
or query is empty (in the validation block referencing variables
research_output, query and the EvalStatus enum) so it fits under 100 chars per
line — e.g., build the response dict in multiple lines or assign it to a local
variable (e.g., response = {...} with each key on its own line) and then return
that variable for both checks to keep lines short and avoid exceeding the
100-character limit.

In `@src/d4bl/services/langfuse/client.py`:
- Around line 39-40: Move the inline import of get_settings out of the function
and place it at module level: remove the local "from d4bl.settings import
get_settings" inside the code that checks "if get_settings().is_docker and
'localhost' in langfuse_host" and instead import get_settings once at the top of
src/d4bl/services/langfuse/client.py; update any references to get_settings() in
that module to use the module-level import so the check using
get_settings().is_docker and langfuse_host remains unchanged but avoids repeated
inline imports.

In `@src/d4bl/services/langfuse/hallucination.py`:
- Around line 24-29: The three validation return statements for query, answer,
and context in hallucination.py exceed the 100-char line limit; split each
return dict across multiple lines so keys and values are on their own lines
(e.g., break the return {"error": ..., "status": EvalStatus.FAILED,
"error_type": "validation"} into a multi-line dict) for the checks that use if
not query or not query.strip(), if not answer or not answer.strip(), and if not
context or not context.strip(), keeping EvalStatus.FAILED and "error_type":
"validation" intact.

In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 8-15: Update keyword_relevance: replace the EN DASH in the
docstring with a normal hyphen-minus and move the empty-query check before
computing matches to avoid confusing flow and potential division by zero;
specifically, in function keyword_relevance ensure you return a default score
(e.g., 3.0) if query_words is empty before computing matches, then compute
matches and the final scaled score using query_words and text_lower as currently
done.

In `@src/d4bl/services/langfuse/reference.py`:
- Around line 25-30: The validation return statements for the
query/answer/context checks exceed the 100-character line limit; refactor each
conditional (the checks that use EvalStatus and return the error dict) to break
the returned dict across multiple lines or assign the error message and the dict
to short-named variables before returning so no single line exceeds 100 chars
(e.g., split the dict keys/value pairs onto separate lines or use a small helper
like _validation_error(status, error_type, message) and call it from the three
places). Ensure you still return the same keys ("error", "status", "error_type")
and use EvalStatus.FAILED.

In `@src/d4bl/services/langfuse/runner.py`:
- Around line 152-160: The parallel evaluation loop currently waits indefinitely
for futures; wrap as_completed or future.result with a timeout and handle
TimeoutError: use as_completed(futures, timeout=...) or call
future.result(timeout=...) inside the loop, catch
concurrent.futures.TimeoutError, then cancel outstanding futures (call
future.cancel() for any unfinished futures in the futures dict), record a
timeout failure into results["evaluations"][name] (or a generic timeout entry)
and log the timeout; update the block that creates the ThreadPoolExecutor,
futures, and the loop over as_completed so that _run_eval calls are bounded and
hung evaluators don't block the whole job.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af2a68e and a6eeb17.

📒 Files selected for processing (16)

src/d4bl/observability/langfuse.py
src/d4bl/services/langfuse/__init__.py
src/d4bl/services/langfuse/_base.py
src/d4bl/services/langfuse/bias.py
src/d4bl/services/langfuse/client.py
src/d4bl/services/langfuse/content_relevance.py
src/d4bl/services/langfuse/evals.py
src/d4bl/services/langfuse/hallucination.py
src/d4bl/services/langfuse/parsers.py
src/d4bl/services/langfuse/quality.py
src/d4bl/services/langfuse/reference.py
src/d4bl/services/langfuse/report_relevance.py
src/d4bl/services/langfuse/runner.py
src/d4bl/services/langfuse/source_relevance.py
src/d4bl/services/research_runner.py
src/d4bl/settings.py

💤 Files with no reviewable changes (1)

src/d4bl/services/langfuse/evals.py

- Add docstring to EvalStatus enum - Fix EN DASH in keyword_relevance docstring, reorder early return - Break long validation return lines in bias, hallucination, reference - Move get_settings import to module level in client.py - Add 120s timeout guard for parallel evaluation batch

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 10-15: The scoring divides by len(query_words) while matches only
count tokens with len > 3; fix by creating a filtered token set (e.g.,
filtered_query_words = [w for w in query_words if len(w) > 3]) and use that for
both matching and as the denominator: compute matches over filtered_query_words
against text_lower, and return max(1.0, min(5.0, (matches /
len(filtered_query_words)) * 5)). If filtered_query_words is empty (query had
only short tokens), preserve current fallback behavior (return 3.0) to avoid
division-by-zero and keep neutral score.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6eeb17 and b2ab5a8.

📒 Files selected for processing (7)

src/d4bl/services/langfuse/_base.py
src/d4bl/services/langfuse/bias.py
src/d4bl/services/langfuse/client.py
src/d4bl/services/langfuse/hallucination.py
src/d4bl/services/langfuse/parsers.py
src/d4bl/services/langfuse/reference.py
src/d4bl/services/langfuse/runner.py

- Use significant_words set for both matching and denominator in keyword_relevance() to avoid score inflation from short words - Replace ThreadPoolExecutor context manager with explicit shutdown(wait=False, cancel_futures=True) to prevent blocking when futures are still running after timeout

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/d4bl/services/langfuse/runner.py (1)

152-179: ⚠️ Potential issue | 🟠 Major

Avoid flushing Langfuse immediately after timed-out non-blocking shutdown.

When timeout is hit (Line 162), executor.shutdown(wait=False, cancel_futures=True) returns immediately without waiting for running worker threads to complete. Those workers may still execute langfuse.score_current_trace() while Line 179 calls flush(), creating a race condition that can lead to incomplete or corrupted telemetry data.

🔧 Proposed fix

-    try:
+    timed_out = False
+    try:
         for future in as_completed(futures, timeout=eval_timeout_s):
             name, result = future.result()
             results["evaluations"][name] = result
     except TimeoutError:
+        timed_out = True
         logger.error(
             "Evaluation batch timed out after %ss", eval_timeout_s,
         )
@@
     finally:
         executor.shutdown(wait=False, cancel_futures=True)

     # --- Flush Langfuse once at the end (finding 3.6) ---
-    if langfuse:
+    if langfuse and not timed_out:
         try:
             langfuse.flush()
         except Exception as flush_err:
             logger.warning("Langfuse flush failed: %s", flush_err)
+    elif timed_out:
+        eval_logger.warning(
+            "Skipping immediate Langfuse flush after timeout; worker threads may still be running."
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/services/langfuse/runner.py` around lines 152 - 179, The executor is
shut down with wait=False and cancel_futures=True which returns immediately
while worker threads may still call langfuse.score_current_trace(), causing a
race when langfuse.flush() runs; in the finally block change shutdown to wait
for running tasks to finish (e.g., executor.shutdown(wait=True,
cancel_futures=False) or call concurrent.futures.wait(futures.keys()) before
calling langfuse.flush()) so that _run_eval workers (and any calls to
langfuse.score_current_trace()) complete before invoking langfuse.flush().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/d4bl/services/langfuse/runner.py`:
- Around line 152-179: The executor is shut down with wait=False and
cancel_futures=True which returns immediately while worker threads may still
call langfuse.score_current_trace(), causing a race when langfuse.flush() runs;
in the finally block change shutdown to wait for running tasks to finish (e.g.,
executor.shutdown(wait=True, cancel_futures=False) or call
concurrent.futures.wait(futures.keys()) before calling langfuse.flush()) so that
_run_eval workers (and any calls to langfuse.score_current_trace()) complete
before invoking langfuse.flush().

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2ab5a8 and 1643770.

📒 Files selected for processing (2)

src/d4bl/services/langfuse/parsers.py
src/d4bl/services/langfuse/runner.py

coderabbitai Bot requested changes Mar 3, 2026

View reviewed changes

Comment thread src/d4bl/services/langfuse/parsers.py Outdated

Comment thread src/d4bl/services/langfuse/runner.py

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

coderabbitai Bot approved these changes Mar 3, 2026

View reviewed changes

William-Hill merged commit 80072e5 into main Mar 3, 2026

William-Hill deleted the feature/simplify-services-phase3 branch March 3, 2026 15:30

William-Hill mentioned this pull request Mar 3, 2026

fix: Race condition between executor shutdown and Langfuse flush on timeout #33

Closed

coderabbitai Bot mentioned this pull request Mar 3, 2026

Simplify: app/ module (#21) #35

Merged

6 tasks

William-Hill mentioned this pull request Mar 4, 2026

Epic: Code Simplification Sweep #29

Closed

10 tasks

coderabbitai Bot mentioned this pull request Mar 5, 2026

refactor: Modernize type hints across codebase (#31) #45

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Simplify services/ module — dedup, enums, parallel evals#32

refactor: Simplify services/ module — dedup, enums, parallel evals#32
William-Hill merged 3 commits into
mainfrom
feature/simplify-services-phase3

William-Hill commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

William-Hill commented Mar 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

William-Hill commented Mar 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading