Skip to content

refactor: Simplify services/ module — dedup, enums, parallel evals#32

Merged
William-Hill merged 3 commits into
mainfrom
feature/simplify-services-phase3
Mar 3, 2026
Merged

refactor: Simplify services/ module — dedup, enums, parallel evals#32
William-Hill merged 3 commits into
mainfrom
feature/simplify-services-phase3

Conversation

@William-Hill

@William-Hill William-Hill commented Mar 3, 2026

Copy link
Copy Markdown
Owner

Summary

Continues issue #19 (services/ module simplification) with Phases 3-5, completing 8 more findings (17/20 total).

Phase 3 — Safe/Quick wins:

  • 1.3: Extract duplicated _keyword_relevance to parsers.keyword_relevance() (was in 3 files)
  • 1.4: Verified JSON parse dedup is complete — no action needed
  • 2.7: Remove evals.py compatibility shim, update import in research_runner.py
  • 3.7: Centralize Docker detection via Settings.is_docker (replaces os.path.exists("/.dockerenv") in client.py + observability/langfuse.py)

Phase 4 — Low risk:

  • 2.3: Add EvalStatus(str, Enum) replacing stringly-typed status values across all evaluators and runner
  • 2.6: Extract notify_progress() helper deduplicating 3 identical set_status + send_websocket_update blocks (~30 LOC saved)
  • 1.1: Docker detection now shared; singletons kept separate (different architectural purposes)

Phase 5 — Parallel evaluations:

  • 3.1: Replace sequential evaluation loop with ThreadPoolExecutor — all evaluations (including optional content/report relevance) run concurrently

Intentionally deferred (3 findings):

  • 1.2: Retry reuse — call_llm_text uses linear delay vs retry_with_backoff's exponential; different semantics, low ROI
  • 2.5: Parameter sprawl in update_job_status — single caller, params mirror DB columns
  • 3.4: Parallel URL evals within content_relevance.py — marginal gain within single evaluator

Test plan

  • All 52 tests pass after each phase
  • EvalStatus extends str so existing string comparisons remain backward-compatible
  • Manual verification of parallel evaluations with live Ollama (thread safety of LLM HTTP calls)

Closes #19

Summary by CodeRabbit

  • New Features

    • Parallelized evaluations for faster, concurrent processing.
    • Runtime Docker detection via settings to auto-adjust behavior.
  • Improvements

    • Standardized evaluation statuses with a shared enum (SUCCESS, FAILED, SKIPPED, PARTIAL_SUCCESS).
    • Centralized keyword relevance scoring via a shared parser utility.
  • Changes

    • Consolidated progress notifications for clearer job-stage reporting.

Phase 3-5 of issue #19 code simplification:

- Extract duplicated _keyword_relevance to parsers.keyword_relevance()
- Remove evals.py compatibility shim (update import in research_runner)
- Centralize Docker detection via Settings.is_docker
- Add EvalStatus(str, Enum) replacing stringly-typed status values
- Parallelize evaluations with ThreadPoolExecutor (finding 3.1)
- Extract notify_progress() helper deduplicating 3 identical blocks
- Verify JSON parse dedup is complete (finding 1.4 — no action needed)

Intentionally deferred: 1.2 (retry reuse — different semantics),
2.5 (param sprawl — single caller), 3.4 (parallel URL evals — marginal)

17/20 findings now addressed for services/ module.
@coderabbitai

coderabbitai Bot commented Mar 3, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Centralizes Docker detection in Settings.is_docker, introduces an EvalStatus enum and replaces string statuses across langfuse evaluators, adds a keyword_relevance parser, parallelizes evaluations in the runner with ThreadPoolExecutor, removes a redundant evals shim, and centralizes progress notifications in research_runner.

Changes

Cohort / File(s) Summary
Settings & Observability
src/d4bl/settings.py, src/d4bl/observability/langfuse.py
Adds Settings.is_docker and switches Docker checks to use it (centralizes environment detection).
Langfuse Client
src/d4bl/services/langfuse/client.py
Replaces filesystem Docker check (/.dockerenv) with get_settings().is_docker when adjusting host for Docker environments.
EvalStatus Enum & Export
src/d4bl/services/langfuse/_base.py, src/d4bl/services/langfuse/__init__.py
Adds EvalStatus (str Enum: SUCCESS, FAILED, SKIPPED, PARTIAL_SUCCESS) and exports it from the package.
Evaluator Status Standardization
src/d4bl/services/langfuse/quality.py, .../bias.py, .../hallucination.py, .../reference.py, .../content_relevance.py, .../source_relevance.py, .../report_relevance.py
Replaces string status literals with EvalStatus enum values in return payloads and error paths across evaluators.
Parsers: keyword_relevance
src/d4bl/services/langfuse/parsers.py
Adds keyword_relevance(query, text) -> float to consolidate keyword-overlap scoring and used as fallback in relevance evaluators.
Runner: Parallelization & Status Aggregation
src/d4bl/services/langfuse/runner.py
Refactors sequential evaluator execution into data-driven parallel execution using ThreadPoolExecutor, adds _run_eval, unified eval_specs, timeout handling, and aggregates overall status using EvalStatus.
Compatibility shim removal
src/d4bl/services/langfuse/evals.py
Removes the re-export shim file that duplicated the public API (deleted).
Research runner progress updates
src/d4bl/services/research_runner.py
Introduces nested notify_progress helper to consolidate DB status + websocket updates and updates import to call runner's comprehensive evaluation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Runner as runner.py
    participant Executor as ThreadPoolExecutor
    participant Eval as Evaluator
    participant Langfuse as LangfuseClient

    Client->>Runner: run_comprehensive_evaluation(...)
    Runner->>Runner: build eval_specs (name, func, kwargs)
    Runner->>Executor: submit(_run_eval) for each spec
    par Parallel evaluations
        Executor->>Eval: execute evaluator function
        Eval->>Langfuse: call LLM / Langfuse client
        Langfuse-->>Eval: return evaluation result
        Eval-->>Executor: return result dict (with EvalStatus)
    end
    Executor->>Runner: collect results / handle exceptions / timeouts
    Runner->>Runner: aggregate EvalStatus -> overall status
    Runner-->>Client: return aggregated results with EvalStatus
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • Epic: Code Simplification Sweep #29: Implements centralized Settings.is_docker and replaces string status values with an EvalStatus enum, addressing the issue's Docker-detection and status-typing objectives.

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: deduplication, enum introduction, and parallelization of evaluations.
Linked Issues check ✅ Passed All primary objectives from issue #19 are addressed: deduplication via keyword_relevance extraction, enum introduction via EvalStatus, parallel evaluations via ThreadPoolExecutor, Docker detection centralization, and evals.py removal.
Out of Scope Changes check ✅ Passed All changes align with issue #19 objectives; no out-of-scope modifications detected beyond the intended code simplification and refactoring scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/simplify-services-phase3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/services/langfuse/_base.py`:
- Around line 12-17: Add a concise docstring to the public EvalStatus enum
describing its purpose and the meaning of each member (SUCCESS, FAILED, SKIPPED,
PARTIAL_SUCCESS) so users of the API understand what the status values
represent; modify the EvalStatus class definition to include this explanatory
string literal immediately under the class declaration.

In `@src/d4bl/services/langfuse/bias.py`:
- Around line 17-20: The return statements for validation of research_output and
query are too long; refactor the long dict literal returned when research_output
or query is empty (in the validation block referencing variables
research_output, query and the EvalStatus enum) so it fits under 100 chars per
line — e.g., build the response dict in multiple lines or assign it to a local
variable (e.g., response = {...} with each key on its own line) and then return
that variable for both checks to keep lines short and avoid exceeding the
100-character limit.

In `@src/d4bl/services/langfuse/client.py`:
- Around line 39-40: Move the inline import of get_settings out of the function
and place it at module level: remove the local "from d4bl.settings import
get_settings" inside the code that checks "if get_settings().is_docker and
'localhost' in langfuse_host" and instead import get_settings once at the top of
src/d4bl/services/langfuse/client.py; update any references to get_settings() in
that module to use the module-level import so the check using
get_settings().is_docker and langfuse_host remains unchanged but avoids repeated
inline imports.

In `@src/d4bl/services/langfuse/hallucination.py`:
- Around line 24-29: The three validation return statements for query, answer,
and context in hallucination.py exceed the 100-char line limit; split each
return dict across multiple lines so keys and values are on their own lines
(e.g., break the return {"error": ..., "status": EvalStatus.FAILED,
"error_type": "validation"} into a multi-line dict) for the checks that use if
not query or not query.strip(), if not answer or not answer.strip(), and if not
context or not context.strip(), keeping EvalStatus.FAILED and "error_type":
"validation" intact.

In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 8-15: Update keyword_relevance: replace the EN DASH in the
docstring with a normal hyphen-minus and move the empty-query check before
computing matches to avoid confusing flow and potential division by zero;
specifically, in function keyword_relevance ensure you return a default score
(e.g., 3.0) if query_words is empty before computing matches, then compute
matches and the final scaled score using query_words and text_lower as currently
done.

In `@src/d4bl/services/langfuse/reference.py`:
- Around line 25-30: The validation return statements for the
query/answer/context checks exceed the 100-character line limit; refactor each
conditional (the checks that use EvalStatus and return the error dict) to break
the returned dict across multiple lines or assign the error message and the dict
to short-named variables before returning so no single line exceeds 100 chars
(e.g., split the dict keys/value pairs onto separate lines or use a small helper
like _validation_error(status, error_type, message) and call it from the three
places). Ensure you still return the same keys ("error", "status", "error_type")
and use EvalStatus.FAILED.

In `@src/d4bl/services/langfuse/runner.py`:
- Around line 152-160: The parallel evaluation loop currently waits indefinitely
for futures; wrap as_completed or future.result with a timeout and handle
TimeoutError: use as_completed(futures, timeout=...) or call
future.result(timeout=...) inside the loop, catch
concurrent.futures.TimeoutError, then cancel outstanding futures (call
future.cancel() for any unfinished futures in the futures dict), record a
timeout failure into results["evaluations"][name] (or a generic timeout entry)
and log the timeout; update the block that creates the ThreadPoolExecutor,
futures, and the loop over as_completed so that _run_eval calls are bounded and
hung evaluators don't block the whole job.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af2a68e and a6eeb17.

📒 Files selected for processing (16)
  • src/d4bl/observability/langfuse.py
  • src/d4bl/services/langfuse/__init__.py
  • src/d4bl/services/langfuse/_base.py
  • src/d4bl/services/langfuse/bias.py
  • src/d4bl/services/langfuse/client.py
  • src/d4bl/services/langfuse/content_relevance.py
  • src/d4bl/services/langfuse/evals.py
  • src/d4bl/services/langfuse/hallucination.py
  • src/d4bl/services/langfuse/parsers.py
  • src/d4bl/services/langfuse/quality.py
  • src/d4bl/services/langfuse/reference.py
  • src/d4bl/services/langfuse/report_relevance.py
  • src/d4bl/services/langfuse/runner.py
  • src/d4bl/services/langfuse/source_relevance.py
  • src/d4bl/services/research_runner.py
  • src/d4bl/settings.py
💤 Files with no reviewable changes (1)
  • src/d4bl/services/langfuse/evals.py

Comment thread src/d4bl/services/langfuse/_base.py
Comment thread src/d4bl/services/langfuse/bias.py Outdated
Comment thread src/d4bl/services/langfuse/client.py Outdated
Comment thread src/d4bl/services/langfuse/hallucination.py Outdated
Comment thread src/d4bl/services/langfuse/parsers.py Outdated
Comment thread src/d4bl/services/langfuse/reference.py Outdated
Comment thread src/d4bl/services/langfuse/runner.py Outdated
- Add docstring to EvalStatus enum
- Fix EN DASH in keyword_relevance docstring, reorder early return
- Break long validation return lines in bias, hallucination, reference
- Move get_settings import to module level in client.py
- Add 120s timeout guard for parallel evaluation batch

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 10-15: The scoring divides by len(query_words) while matches only
count tokens with len > 3; fix by creating a filtered token set (e.g.,
filtered_query_words = [w for w in query_words if len(w) > 3]) and use that for
both matching and as the denominator: compute matches over filtered_query_words
against text_lower, and return max(1.0, min(5.0, (matches /
len(filtered_query_words)) * 5)). If filtered_query_words is empty (query had
only short tokens), preserve current fallback behavior (return 3.0) to avoid
division-by-zero and keep neutral score.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6eeb17 and b2ab5a8.

📒 Files selected for processing (7)
  • src/d4bl/services/langfuse/_base.py
  • src/d4bl/services/langfuse/bias.py
  • src/d4bl/services/langfuse/client.py
  • src/d4bl/services/langfuse/hallucination.py
  • src/d4bl/services/langfuse/parsers.py
  • src/d4bl/services/langfuse/reference.py
  • src/d4bl/services/langfuse/runner.py

Comment thread src/d4bl/services/langfuse/parsers.py Outdated
Comment thread src/d4bl/services/langfuse/runner.py
- Use significant_words set for both matching and denominator in
  keyword_relevance() to avoid score inflation from short words
- Replace ThreadPoolExecutor context manager with explicit
  shutdown(wait=False, cancel_futures=True) to prevent blocking
  when futures are still running after timeout

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/d4bl/services/langfuse/runner.py (1)

152-179: ⚠️ Potential issue | 🟠 Major

Avoid flushing Langfuse immediately after timed-out non-blocking shutdown.

When timeout is hit (Line 162), executor.shutdown(wait=False, cancel_futures=True) returns immediately without waiting for running worker threads to complete. Those workers may still execute langfuse.score_current_trace() while Line 179 calls flush(), creating a race condition that can lead to incomplete or corrupted telemetry data.

🔧 Proposed fix
-    try:
+    timed_out = False
+    try:
         for future in as_completed(futures, timeout=eval_timeout_s):
             name, result = future.result()
             results["evaluations"][name] = result
     except TimeoutError:
+        timed_out = True
         logger.error(
             "Evaluation batch timed out after %ss", eval_timeout_s,
         )
@@
     finally:
         executor.shutdown(wait=False, cancel_futures=True)

     # --- Flush Langfuse once at the end (finding 3.6) ---
-    if langfuse:
+    if langfuse and not timed_out:
         try:
             langfuse.flush()
         except Exception as flush_err:
             logger.warning("Langfuse flush failed: %s", flush_err)
+    elif timed_out:
+        eval_logger.warning(
+            "Skipping immediate Langfuse flush after timeout; worker threads may still be running."
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/services/langfuse/runner.py` around lines 152 - 179, The executor is
shut down with wait=False and cancel_futures=True which returns immediately
while worker threads may still call langfuse.score_current_trace(), causing a
race when langfuse.flush() runs; in the finally block change shutdown to wait
for running tasks to finish (e.g., executor.shutdown(wait=True,
cancel_futures=False) or call concurrent.futures.wait(futures.keys()) before
calling langfuse.flush()) so that _run_eval workers (and any calls to
langfuse.score_current_trace()) complete before invoking langfuse.flush().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/d4bl/services/langfuse/runner.py`:
- Around line 152-179: The executor is shut down with wait=False and
cancel_futures=True which returns immediately while worker threads may still
call langfuse.score_current_trace(), causing a race when langfuse.flush() runs;
in the finally block change shutdown to wait for running tasks to finish (e.g.,
executor.shutdown(wait=True, cancel_futures=False) or call
concurrent.futures.wait(futures.keys()) before calling langfuse.flush()) so that
_run_eval workers (and any calls to langfuse.score_current_trace()) complete
before invoking langfuse.flush().

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2ab5a8 and 1643770.

📒 Files selected for processing (2)
  • src/d4bl/services/langfuse/parsers.py
  • src/d4bl/services/langfuse/runner.py

@William-Hill William-Hill merged commit 80072e5 into main Mar 3, 2026
@William-Hill William-Hill deleted the feature/simplify-services-phase3 branch March 3, 2026 15:30
@coderabbitai coderabbitai Bot mentioned this pull request Mar 3, 2026
6 tasks
@William-Hill William-Hill mentioned this pull request Mar 4, 2026
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify: services/ module

1 participant