Skip to content

refactor: Modernize type hints across codebase (#31)#45

Merged
William-Hill merged 4 commits into
mainfrom
refactor/modernize-type-hints
Mar 5, 2026
Merged

refactor: Modernize type hints across codebase (#31)#45
William-Hill merged 4 commits into
mainfrom
refactor/modernize-type-hints

Conversation

@William-Hill

@William-Hill William-Hill commented Mar 5, 2026

Copy link
Copy Markdown
Owner

Summary

  • Replace deprecated typing module imports with modern Python 3.10+ built-in equivalents across 26 files
  • Dictdict, Listlist, Optional[X]X | None, Union[X, Y]X | Y
  • Callablecollections.abc.Callable, AsyncGeneratorcollections.abc.AsyncGenerator
  • Add from __future__ import annotations to 7 files that were missing it
  • Only Any, TypeVar, Literal, and Type remain in typing imports (not deprecated)

Closes #31

Test plan

  • All 121 existing tests pass
  • All module imports verified at runtime
  • Zero Optional[, Dict[, List[, Union[ annotations remaining
  • Zero deprecated typing imports remaining

Summary by CodeRabbit

  • Refactor

    • Modernized type annotations across the codebase (built-in generics, PEP 604 unions, postponed evaluation); updated many public signatures for consistency. No behavioral changes.
  • New Features

    • Implemented client-side PDF extraction to download and extract text and metadata reliably.
  • Chores

    • Added public helpers for websocket and job log management; improved JSON parsing robustness and minor typing/import cleanups.

Replace deprecated typing module imports with modern Python 3.10+ equivalents:
- Dict → dict, List → list, Optional[X] → X | None, Union[X, Y] → X | Y
- Callable → collections.abc.Callable, AsyncGenerator → collections.abc.AsyncGenerator
- Add `from __future__ import annotations` where missing
- 26 files updated, zero behavior change, all 121 tests pass
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cddacf1f-6546-4794-9f52-ee9d4f1e064c

📥 Commits

Reviewing files that changed from the base of the PR and between 46b7e27 and a7d7073.

📒 Files selected for processing (1)
  • src/d4bl/infra/vector_store.py

📝 Walkthrough

Walkthrough

Modernizes type hints across the codebase by replacing typing module aliases with Python 3.10+ built-ins (e.g., Listlist, Dictdict, Optional[X]X | None, Union|) and enables postponed evaluation via from __future__ import annotations. No runtime logic changes.

Changes

Cohort / File(s) Summary
Agent Management
src/d4bl/agents/crew.py
Updated class/instance attribute annotations: List[...]list[...], Optional[List[str]] → `list[str]
Crawl Tools
src/d4bl/agents/tools/crawl_tools/crawl4ai.py, src/d4bl/agents/tools/crawl_tools/firecrawl.py, src/d4bl/agents/tools/crawl_tools/pdf_extraction.py, src/d4bl/agents/tools/crawl_tools/utils.py
Replaced List/Optional/Union with built-in generics and `
API & Schemas
src/d4bl/app/api.py, src/d4bl/app/schemas.py
Converted endpoint parameters, response models, and Pydantic fields to list[...] and `X
WebSocket Manager
src/d4bl/app/websocket_manager.py
Switched Dict/Listdict/list, updated optional return types to `X
Vector Store
src/d4bl/infra/vector_store.py
Replaced Optional and typing collections with built-in generics; updated constructor and public method signatures (embeddings, metadata, store/search APIs) to list[...], dict[...], and `X
Query Components
src/d4bl/query/engine.py, src/d4bl/query/fusion.py, src/d4bl/query/parser.py, src/d4bl/query/structured.py
Added future annotations and changed Optional[...] to `X
Error Handling
src/d4bl/services/error_handling.py
Moved Callable import to collections.abc, converted optional callables and generics to `
Langfuse Services
src/d4bl/services/langfuse/...
/_base.py, /bias.py, /client.py, /content_relevance.py, /hallucination.py, /parsers.py, /quality.py, /reference.py, /report_relevance.py, /runner.py, /source_relevance.py
Comprehensive typing modernization: Dictdict, Listlist, Optional→`
Research Runner
src/d4bl/services/research_runner.py
Updated function and nested helper signatures to use `X

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning One out-of-scope change detected: a functional fix to vector_store.py changing the limit parameter check from truthy to explicit None check (limit=0 now honored), which is a behavior change beyond the mechanical refactor scope. Extract the limit parameter fix to a separate issue/PR to keep this PR purely mechanical. Document the behavioral change separately for review and testing.
Docstring Coverage ⚠️ Warning Docstring coverage is 70.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: modernizing type hints across the codebase using Python 3.10+ syntax.
Linked Issues check ✅ Passed All changes align with issue #31 requirements: Dict→dict, List→list, Optional[X]→X|None, Union→|, Callable→collections.abc.Callable conversions implemented across 26 files with no behavioral changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/modernize-type-hints

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/d4bl/agents/tools/crawl_tools/pdf_extraction.py (3)

155-156: ⚠️ Potential issue | 🟠 Major

Escape extracted text before embedding it into HTML fields.

Raw PDF text can contain markup-like content; embedding it directly into cleaned_html/html can propagate executable HTML/JS downstream.

Proposed fix
+import html
@@
-            return {
+            preview = html.escape(extracted_text[:1000], quote=True)
+            return {
                 "url": url,
                 "extracted_content": extracted_text,
-                "cleaned_html": f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>",
-                "html": f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>",
+                "cleaned_html": f"<html><body><pre>{preview}...</pre></body></html>",
+                "html": f"<html><body><pre>{preview}...</pre></body></html>",
                 "metadata": metadata,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py` around lines 155 - 156,
Sanitize extracted_text before embedding into the cleaned_html and html fields
to prevent injected HTML/JS: in the code that builds the dict entries
"cleaned_html" and "html" (where they currently use
f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>"), replace raw
insertion with an escaped version of extracted_text (e.g., call a utility like
html.escape or an existing escape_html function on extracted_text[:1000]) and
then interpolate the escaped string so the pre block cannot render markup or
scripts.

89-89: ⚠️ Potential issue | 🟡 Minor

Split the User-Agent string to satisfy max line length.

Line 89 appears to exceed the 100-character Python line-length rule.

Proposed fix
-            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            "User-Agent": (
+                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
+                "AppleWebKit/537.36 (KHTML, like Gecko) "
+                "Chrome/91.0.4472.124 Safari/537.36"
+            ),

As per coding guidelines: **/*.py must enforce a 100 character maximum line length in Python code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py` at line 89, The
'User-Agent' header string in the headers dict (in
src/d4bl/agents/tools/crawl_tools/pdf_extraction.py) exceeds the 100-char
line-length rule; break it into multiple shorter string literals (e.g., use
implicit adjacent string literals or concatenate with + inside the same
expression) for the 'User-Agent' value so the line length is under 100 chars
while preserving the exact header value (reference the headers dict and the
'User-Agent' key to locate the change).

94-112: ⚠️ Potential issue | 🟠 Major

Enforce maximum PDF download size and use chunked streaming.

stream=True is negated by response.content, which buffers the entire response in memory. Without a size limit, this enables resource-exhaustion attacks (e.g., downloading multi-GB files). Additionally, the response object is not managed as a context manager, risking unclosed connections.

Implement chunked reading with response.iter_content() and enforce a maximum download size before writing to disk.

Proposed fix
-        response = requests.get(
-            url, 
-            timeout=timeout * 2,  # PDFs may take longer
-            stream=True,
-            headers=headers,
-            allow_redirects=True
-        )
-        response.raise_for_status()
-        
-        # Check content type
-        content_type = response.headers.get('Content-Type', '').lower()
-        if 'pdf' not in content_type and not url.lower().endswith('.pdf'):
-            logger.warning("URL does not appear to be a PDF (Content-Type: %s)", content_type)
-            # Continue anyway, might still be a PDF
-        
-        # Save to temporary file
-        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
-            tmp_file.write(response.content)
-            tmp_path = tmp_file.name
+        max_pdf_bytes = 25 * 1024 * 1024  # 25MB safety cap
+        downloaded = 0
+        with requests.get(
+            url,
+            timeout=timeout * 2,  # PDFs may take longer
+            stream=True,
+            headers=headers,
+            allow_redirects=True,
+        ) as response:
+            response.raise_for_status()
+
+            content_type = response.headers.get('Content-Type', '').lower()
+            if 'pdf' not in content_type and not url.lower().endswith('.pdf'):
+                logger.warning("URL does not appear to be a PDF (Content-Type: %s)", content_type)
+
+            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
+                for chunk in response.iter_content(chunk_size=8192):
+                    if not chunk:
+                        continue
+                    downloaded += len(chunk)
+                    if downloaded > max_pdf_bytes:
+                        raise ValueError(f"PDF exceeds max allowed size: {max_pdf_bytes} bytes")
+                    tmp_file.write(chunk)
+                tmp_path = tmp_file.name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py` around lines 94 - 112,
Replace the current full-buffer download (use of response.content) with chunked
streaming under a context manager: open the response with "with
requests.get(..., stream=True) as response:", call response.raise_for_status(),
then write response.iter_content(chunk_size=8192) into the
tempfile.NamedTemporaryFile while tracking a running byte counter and aborting
(closing response, deleting partial tmp file and raising/logging) if the total
exceeds a configured max_download_bytes (e.g., MAX_PDF_BYTES). Ensure you still
check Content-Type and URL suffix as before, set tmp_path from tmp_file.name
after successful completion, and use response.iter_content to avoid buffering
the whole response and to properly close the connection.
src/d4bl/services/langfuse/hallucination.py (1)

16-24: 🛠️ Refactor suggestion | 🟠 Major

Add a docstring for evaluate_hallucination.

This public API is changed in this PR but still has no function-level docstring.

📝 Proposed fix
 def evaluate_hallucination(
     query: str,
     answer: str,
     context: str,
     trace_id: str | None = None,
     llm: Any = None,
     langfuse: Any = None,
 ) -> dict[str, Any]:
+    """Evaluate whether the answer is grounded in the provided context."""

As per coding guidelines **/*.py: Include docstrings for public APIs in Python.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/services/langfuse/hallucination.py` around lines 16 - 24, Add a
clear docstring to the public function evaluate_hallucination describing its
purpose (evaluate whether an LLM answer hallucinates given a query and context),
list and type each parameter (query: str, answer: str, context: str, trace_id:
Optional[str], llm: Any, langfuse: Any), explain the returned value shape
(dict[str, Any] and what keys/values represent), and note side effects (e.g.,
may emit logs/events to langfuse or use the llm client) and any exceptions
raised; place the docstring immediately inside the evaluate_hallucination
function definition so tools and linters recognize it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/agents/tools/crawl_tools/crawl4ai.py`:
- Line 223: The method _handle_pdfs_separately(pdf_urls: list[str], query: str)
declares an unused parameter query which should be removed to fix ARG002; change
the signature to _handle_pdfs_separately(pdf_urls: list[str]) and update the
return type if necessary, then find and update all call sites that pass a query
(e.g., any usages inside this module or tests) to call
_handle_pdfs_separately(pdf_urls) instead, and adjust any references or type
hints to match the new signature.
- Line 11: Remove the deprecated typing.Type import and update the annotation
that uses it: delete the "from typing import Type" import and change the
args_schema annotation (the one currently typed with Type[...] in this module,
e.g., the args_schema variable or parameter at line ~67) to use the built-in
generic type syntax "type[...]" instead; ensure any other occurrences in
crawl4ai.py are similarly replaced and run type checks to confirm no remaining
references to typing.Type remain.

In `@src/d4bl/app/schemas.py`:
- Line 14: The line declaring selected_agents currently exceeds the 100-char
limit; refactor the declaration in schemas.py by moving the inline comment to
its own preceding or trailing line and/or shortening the comment so the
declaration line stays under 100 chars; specifically modify the selected_agents:
list[str] | None = None declaration and its comment (e.g., split the comment
"List of agent names to run (e.g., [\"researcher\", \"writer\"])" onto a
separate line or shorten it) so the combined lines conform to the 100-character
max while preserving the same explanatory text.

In `@src/d4bl/infra/vector_store.py`:
- Around line 211-214: engine.py currently passes job_id: str | None into
search_similar which expects UUID | None; fix by converting the string to a UUID
before calling vector_store.search_similar. Locate where job_id is
declared/received in src/d4bl/query/engine.py (the variable named job_id) and
call the existing parse_job_uuid(job_id) (same helper used in api.py) to produce
a UUID | None, then pass that result into search_similar; alternatively, if you
prefer the other approach, update the search_similar signature in
vector_store.py to accept job_id: str | UUID | None and adjust internal handling
accordingly so type contracts align.

In `@src/d4bl/query/fusion.py`:
- Line 52: Update the constructor signature to include an explicit return
annotation "-> None" for the __init__ method in fusion.py: locate the def
__init__(self, ollama_base_url: str | None = None) definition and change it to
declare "-> None" so the constructor is properly annotated per PEP 8/type-hint
guidelines.

In `@src/d4bl/query/parser.py`:
- Line 45: The __init__ constructor in the Parser class (def __init__(self,
ollama_base_url: str | None = None)) is missing an explicit return type; update
the signature of __init__ to include the return annotation "-> None" (i.e., def
__init__(self, ollama_base_url: str | None = None) -> None:) so it complies with
the project's typing/coding guidelines.

In `@src/d4bl/services/langfuse/content_relevance.py`:
- Around line 18-25: Add a clear docstring to the public function
evaluate_content_relevance describing its purpose, parameters, and return value:
state that it evaluates relevance of extracted_contents (list[dict]) against a
query (str), explain optional parameters trace_id (str|None) for tracing, llm
and langfuse (Any) for optional integrations, and describe the returned
dict[str, Any] shape (e.g., relevance scores, metadata, timing). Keep it concise
and include behavior/side-effects (uses time measurement, may call llm/langfuse)
and any exceptions or edge cases.

In `@src/d4bl/services/langfuse/parsers.py`:
- Line 76: The function signature for parse_label_score is longer than 100
chars; break it across multiple lines so it conforms to the 100-char limit — for
example, put the parameters (text: str, mapping: dict[str, float],
default_score: float = 3.0) each on their own line and/or move the return
annotation tuple[float, str] to a new line; update the signature in the
parse_label_score definition accordingly so the function name and all type
annotations remain unchanged.

In `@src/d4bl/services/langfuse/runner.py`:
- Around line 60-67: Add a clear docstring to the public function
run_comprehensive_evaluation that briefly describes its purpose and behavior,
documents each parameter (query, research_output, sources, trace_id,
extracted_contents, report) with expected types and meanings, specifies the
return type (dict[str, Any]) and the structure of the returned data, and notes
any exceptions or side-effects (e.g., logging, external calls) so callers know
how to use it; place the docstring immediately below the def
run_comprehensive_evaluation(...) line in standard triple-quoted format.

In `@src/d4bl/services/research_runner.py`:
- Line 144: The parameter annotation for logs is too broad; find the function
signatures containing "logs: list | None = None" (the two occurrences in this
file, one at the shown signature and the other later) and tighten them to "logs:
list[str] | None = None" so the pipeline clearly documents that log entries are
strings; update both occurrences, run type checks, and adjust any related
docstrings or return annotations if they assume a specific list element type.

---

Outside diff comments:
In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py`:
- Around line 155-156: Sanitize extracted_text before embedding into the
cleaned_html and html fields to prevent injected HTML/JS: in the code that
builds the dict entries "cleaned_html" and "html" (where they currently use
f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>"), replace raw
insertion with an escaped version of extracted_text (e.g., call a utility like
html.escape or an existing escape_html function on extracted_text[:1000]) and
then interpolate the escaped string so the pre block cannot render markup or
scripts.
- Line 89: The 'User-Agent' header string in the headers dict (in
src/d4bl/agents/tools/crawl_tools/pdf_extraction.py) exceeds the 100-char
line-length rule; break it into multiple shorter string literals (e.g., use
implicit adjacent string literals or concatenate with + inside the same
expression) for the 'User-Agent' value so the line length is under 100 chars
while preserving the exact header value (reference the headers dict and the
'User-Agent' key to locate the change).
- Around line 94-112: Replace the current full-buffer download (use of
response.content) with chunked streaming under a context manager: open the
response with "with requests.get(..., stream=True) as response:", call
response.raise_for_status(), then write response.iter_content(chunk_size=8192)
into the tempfile.NamedTemporaryFile while tracking a running byte counter and
aborting (closing response, deleting partial tmp file and raising/logging) if
the total exceeds a configured max_download_bytes (e.g., MAX_PDF_BYTES). Ensure
you still check Content-Type and URL suffix as before, set tmp_path from
tmp_file.name after successful completion, and use response.iter_content to
avoid buffering the whole response and to properly close the connection.

In `@src/d4bl/services/langfuse/hallucination.py`:
- Around line 16-24: Add a clear docstring to the public function
evaluate_hallucination describing its purpose (evaluate whether an LLM answer
hallucinates given a query and context), list and type each parameter (query:
str, answer: str, context: str, trace_id: Optional[str], llm: Any, langfuse:
Any), explain the returned value shape (dict[str, Any] and what keys/values
represent), and note side effects (e.g., may emit logs/events to langfuse or use
the llm client) and any exceptions raised; place the docstring immediately
inside the evaluate_hallucination function definition so tools and linters
recognize it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 20b7c021-8e01-48b3-9e43-0b14a5a10127

📥 Commits

Reviewing files that changed from the base of the PR and between 41dba5c and 85426f6.

📒 Files selected for processing (26)
  • src/d4bl/agents/crew.py
  • src/d4bl/agents/tools/crawl_tools/crawl4ai.py
  • src/d4bl/agents/tools/crawl_tools/firecrawl.py
  • src/d4bl/agents/tools/crawl_tools/pdf_extraction.py
  • src/d4bl/agents/tools/crawl_tools/utils.py
  • src/d4bl/app/api.py
  • src/d4bl/app/schemas.py
  • src/d4bl/app/websocket_manager.py
  • src/d4bl/infra/vector_store.py
  • src/d4bl/query/engine.py
  • src/d4bl/query/fusion.py
  • src/d4bl/query/parser.py
  • src/d4bl/query/structured.py
  • src/d4bl/services/error_handling.py
  • src/d4bl/services/langfuse/_base.py
  • src/d4bl/services/langfuse/bias.py
  • src/d4bl/services/langfuse/client.py
  • src/d4bl/services/langfuse/content_relevance.py
  • src/d4bl/services/langfuse/hallucination.py
  • src/d4bl/services/langfuse/parsers.py
  • src/d4bl/services/langfuse/quality.py
  • src/d4bl/services/langfuse/reference.py
  • src/d4bl/services/langfuse/report_relevance.py
  • src/d4bl/services/langfuse/runner.py
  • src/d4bl/services/langfuse/source_relevance.py
  • src/d4bl/services/research_runner.py

Comment thread src/d4bl/agents/tools/crawl_tools/crawl4ai.py Outdated
Comment thread src/d4bl/agents/tools/crawl_tools/crawl4ai.py Outdated
Comment thread src/d4bl/app/schemas.py Outdated
Comment thread src/d4bl/infra/vector_store.py Outdated
Comment thread src/d4bl/query/fusion.py Outdated
Comment thread src/d4bl/query/parser.py Outdated
Comment thread src/d4bl/services/langfuse/content_relevance.py
Comment thread src/d4bl/services/langfuse/parsers.py Outdated
Comment thread src/d4bl/services/langfuse/runner.py
Comment thread src/d4bl/services/research_runner.py Outdated
- Replace `typing.Type` with builtin `type[]` in crawl4ai.py
- Remove unused `query` param from `_handle_pdfs_separately`
- Move inline comment to own line in schemas.py (100-char limit)
- Widen `search_similar` job_id to `UUID | str | None` for engine.py compat
- Add `-> None` return annotations to `__init__` in fusion.py and parser.py
- Split long function signature in parsers.py
- Tighten `logs: list | None` to `list[str] | None` in research_runner.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/d4bl/services/langfuse/parsers.py (1)

18-25: ⚠️ Potential issue | 🟠 Major

Do not silently collapse JSON parse failures into normal fallback output.

Line [24]-[25] returns {} for JSONDecodeError, which is indistinguishable from “no JSON found.”
Downstream callers then treat this as a normal fallback path and can log default scores as if evaluation succeeded.

Proposed fix
+import logging
 import json
 import re
 from typing import Any
 
+logger = logging.getLogger(__name__)
+
 
 def parse_first_json_block(text: str) -> dict[str, Any]:
     json_match = re.search(r"\{[^{}]*\}", text, re.DOTALL)
     if not json_match:
         return {}
     try:
         return json.loads(json_match.group())
     except json.JSONDecodeError:
+        logger.warning(
+            "Failed to decode JSON block from evaluator output",
+            extra={"payload_preview": text[:200]},
+        )
         return {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/services/langfuse/parsers.py` around lines 18 - 25, The current
parse_first_json_block function swallows json.JSONDecodeError and returns {}
which is indistinguishable from "no JSON found"; update parse_first_json_block
to not silently collapse parse failures—when json.loads(json_match.group())
raises json.JSONDecodeError, capture it and either re-raise an exception
(preserving the original JSONDecodeError) or raise a new ValueError that
includes the error message and the offending json_match.group() text so callers
can distinguish parse errors from "no JSON found" and handle/log them
appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/infra/vector_store.py`:
- Line 24: Update the constructor signature to include an explicit return type
annotation; change the __init__ method declaration (the one accepting
ollama_base_url and embedder_model) to annotate its return as -> None so it
reads as a proper constructor with a None return type in the vector store class.

In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 76-78: Add a docstring to the public function parse_label_score
that clearly documents the parameters (text: the input string to parse, mapping:
a dict[str, float] mapping label names to scores, default_score: float used when
no score can be determined), the return value (tuple[float, str] where the first
element is the numeric score and the second is the parsed label), and the
function behavior (how labels/scores are extracted from text, how mapping is
consulted and when default_score is returned if parsing fails). Include expected
types, any edge-case behavior (e.g., empty text or unknown label), and a short
usage example or note about return guarantees.

---

Outside diff comments:
In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 18-25: The current parse_first_json_block function swallows
json.JSONDecodeError and returns {} which is indistinguishable from "no JSON
found"; update parse_first_json_block to not silently collapse parse
failures—when json.loads(json_match.group()) raises json.JSONDecodeError,
capture it and either re-raise an exception (preserving the original
JSONDecodeError) or raise a new ValueError that includes the error message and
the offending json_match.group() text so callers can distinguish parse errors
from "no JSON found" and handle/log them appropriately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7866e901-dc3c-4572-97ef-2e375ea410d2

📥 Commits

Reviewing files that changed from the base of the PR and between 85426f6 and e7574d9.

📒 Files selected for processing (7)
  • src/d4bl/agents/tools/crawl_tools/crawl4ai.py
  • src/d4bl/app/schemas.py
  • src/d4bl/infra/vector_store.py
  • src/d4bl/query/fusion.py
  • src/d4bl/query/parser.py
  • src/d4bl/services/langfuse/parsers.py
  • src/d4bl/services/research_runner.py

Comment thread src/d4bl/infra/vector_store.py Outdated
Comment thread src/d4bl/services/langfuse/parsers.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/d4bl/infra/vector_store.py (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Line 24 exceeds the 100-character limit.

Please wrap the constructor signature to satisfy the repository’s Python line-length rule.

Suggested fix
-    def __init__(self, ollama_base_url: str | None = None, embedder_model: str = "mxbai-embed-large") -> None:
+    def __init__(
+        self,
+        ollama_base_url: str | None = None,
+        embedder_model: str = "mxbai-embed-large",
+    ) -> None:

As per coding guidelines **/*.py: Enforce 100 character maximum line length in Python code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/d4bl/infra/vector_store.py` at line 24, The __init__ constructor
signature in src/d4bl/infra/vector_store.py exceeds 100 chars; break the long
def __init__(...) line into multiple lines to comply with the 100-character rule
by placing parameters on their own indented lines (e.g., split ollama_base_url:
str | None = None and embedder_model: str = "mxbai-embed-large" onto separate
lines) while preserving type annotations and the return annotation -> None so
the method signature and meaning of __init__ remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/d4bl/infra/vector_store.py`:
- Around line 277-278: The code currently uses truthy checks on the limit
parameter (e.g., "if limit") which treats limit=0 as "no limit" and can return
unbounded rows; update all checks that gate adding a LIMIT or slicing by the
parameter to use explicit None checks (replace "if limit" with "if limit is not
None" and "if not limit" with "if limit is None") so that a numeric 0 is honored
as a valid value in the function signature returning list[dict[str, Any]];
ensure the query-building / slicing logic only omits the LIMIT when limit is
None.

---

Duplicate comments:
In `@src/d4bl/infra/vector_store.py`:
- Line 24: The __init__ constructor signature in src/d4bl/infra/vector_store.py
exceeds 100 chars; break the long def __init__(...) line into multiple lines to
comply with the 100-character rule by placing parameters on their own indented
lines (e.g., split ollama_base_url: str | None = None and embedder_model: str =
"mxbai-embed-large" onto separate lines) while preserving type annotations and
the return annotation -> None so the method signature and meaning of __init__
remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d9022e11-95cd-4bb5-9b88-806b8ce7357f

📥 Commits

Reviewing files that changed from the base of the PR and between e7574d9 and 46b7e27.

📒 Files selected for processing (1)
  • src/d4bl/infra/vector_store.py

Comment thread src/d4bl/infra/vector_store.py
`if limit` treats 0 as falsy, skipping the LIMIT clause entirely.
Use `if limit is not None` so limit=0 is honored correctly.
@William-Hill William-Hill merged commit ad06569 into main Mar 5, 2026
@William-Hill William-Hill deleted the refactor/modernize-type-hints branch March 5, 2026 18:00
@coderabbitai coderabbitai Bot mentioned this pull request Mar 5, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: Modernize type hints across codebase

1 participant