refactor: Modernize type hints across codebase (#31)#45
Conversation
Replace deprecated typing module imports with modern Python 3.10+ equivalents: - Dict → dict, List → list, Optional[X] → X | None, Union[X, Y] → X | Y - Callable → collections.abc.Callable, AsyncGenerator → collections.abc.AsyncGenerator - Add `from __future__ import annotations` where missing - 26 files updated, zero behavior change, all 121 tests pass
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughModernizes type hints across the codebase by replacing typing module aliases with Python 3.10+ built-ins (e.g., Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/d4bl/agents/tools/crawl_tools/pdf_extraction.py (3)
155-156:⚠️ Potential issue | 🟠 MajorEscape extracted text before embedding it into HTML fields.
Raw PDF text can contain markup-like content; embedding it directly into
cleaned_html/htmlcan propagate executable HTML/JS downstream.Proposed fix
+import html @@ - return { + preview = html.escape(extracted_text[:1000], quote=True) + return { "url": url, "extracted_content": extracted_text, - "cleaned_html": f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>", - "html": f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>", + "cleaned_html": f"<html><body><pre>{preview}...</pre></body></html>", + "html": f"<html><body><pre>{preview}...</pre></body></html>", "metadata": metadata,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py` around lines 155 - 156, Sanitize extracted_text before embedding into the cleaned_html and html fields to prevent injected HTML/JS: in the code that builds the dict entries "cleaned_html" and "html" (where they currently use f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>"), replace raw insertion with an escaped version of extracted_text (e.g., call a utility like html.escape or an existing escape_html function on extracted_text[:1000]) and then interpolate the escaped string so the pre block cannot render markup or scripts.
89-89:⚠️ Potential issue | 🟡 MinorSplit the User-Agent string to satisfy max line length.
Line 89 appears to exceed the 100-character Python line-length rule.
Proposed fix
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', + "User-Agent": ( + "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " + "AppleWebKit/537.36 (KHTML, like Gecko) " + "Chrome/91.0.4472.124 Safari/537.36" + ),As per coding guidelines:
**/*.pymust enforce a 100 character maximum line length in Python code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py` at line 89, The 'User-Agent' header string in the headers dict (in src/d4bl/agents/tools/crawl_tools/pdf_extraction.py) exceeds the 100-char line-length rule; break it into multiple shorter string literals (e.g., use implicit adjacent string literals or concatenate with + inside the same expression) for the 'User-Agent' value so the line length is under 100 chars while preserving the exact header value (reference the headers dict and the 'User-Agent' key to locate the change).
94-112:⚠️ Potential issue | 🟠 MajorEnforce maximum PDF download size and use chunked streaming.
stream=Trueis negated byresponse.content, which buffers the entire response in memory. Without a size limit, this enables resource-exhaustion attacks (e.g., downloading multi-GB files). Additionally, the response object is not managed as a context manager, risking unclosed connections.Implement chunked reading with
response.iter_content()and enforce a maximum download size before writing to disk.Proposed fix
- response = requests.get( - url, - timeout=timeout * 2, # PDFs may take longer - stream=True, - headers=headers, - allow_redirects=True - ) - response.raise_for_status() - - # Check content type - content_type = response.headers.get('Content-Type', '').lower() - if 'pdf' not in content_type and not url.lower().endswith('.pdf'): - logger.warning("URL does not appear to be a PDF (Content-Type: %s)", content_type) - # Continue anyway, might still be a PDF - - # Save to temporary file - with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file: - tmp_file.write(response.content) - tmp_path = tmp_file.name + max_pdf_bytes = 25 * 1024 * 1024 # 25MB safety cap + downloaded = 0 + with requests.get( + url, + timeout=timeout * 2, # PDFs may take longer + stream=True, + headers=headers, + allow_redirects=True, + ) as response: + response.raise_for_status() + + content_type = response.headers.get('Content-Type', '').lower() + if 'pdf' not in content_type and not url.lower().endswith('.pdf'): + logger.warning("URL does not appear to be a PDF (Content-Type: %s)", content_type) + + with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file: + for chunk in response.iter_content(chunk_size=8192): + if not chunk: + continue + downloaded += len(chunk) + if downloaded > max_pdf_bytes: + raise ValueError(f"PDF exceeds max allowed size: {max_pdf_bytes} bytes") + tmp_file.write(chunk) + tmp_path = tmp_file.name🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py` around lines 94 - 112, Replace the current full-buffer download (use of response.content) with chunked streaming under a context manager: open the response with "with requests.get(..., stream=True) as response:", call response.raise_for_status(), then write response.iter_content(chunk_size=8192) into the tempfile.NamedTemporaryFile while tracking a running byte counter and aborting (closing response, deleting partial tmp file and raising/logging) if the total exceeds a configured max_download_bytes (e.g., MAX_PDF_BYTES). Ensure you still check Content-Type and URL suffix as before, set tmp_path from tmp_file.name after successful completion, and use response.iter_content to avoid buffering the whole response and to properly close the connection.src/d4bl/services/langfuse/hallucination.py (1)
16-24: 🛠️ Refactor suggestion | 🟠 MajorAdd a docstring for
evaluate_hallucination.This public API is changed in this PR but still has no function-level docstring.
📝 Proposed fix
def evaluate_hallucination( query: str, answer: str, context: str, trace_id: str | None = None, llm: Any = None, langfuse: Any = None, ) -> dict[str, Any]: + """Evaluate whether the answer is grounded in the provided context."""As per coding guidelines
**/*.py: Include docstrings for public APIs in Python.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/services/langfuse/hallucination.py` around lines 16 - 24, Add a clear docstring to the public function evaluate_hallucination describing its purpose (evaluate whether an LLM answer hallucinates given a query and context), list and type each parameter (query: str, answer: str, context: str, trace_id: Optional[str], llm: Any, langfuse: Any), explain the returned value shape (dict[str, Any] and what keys/values represent), and note side effects (e.g., may emit logs/events to langfuse or use the llm client) and any exceptions raised; place the docstring immediately inside the evaluate_hallucination function definition so tools and linters recognize it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/agents/tools/crawl_tools/crawl4ai.py`:
- Line 223: The method _handle_pdfs_separately(pdf_urls: list[str], query: str)
declares an unused parameter query which should be removed to fix ARG002; change
the signature to _handle_pdfs_separately(pdf_urls: list[str]) and update the
return type if necessary, then find and update all call sites that pass a query
(e.g., any usages inside this module or tests) to call
_handle_pdfs_separately(pdf_urls) instead, and adjust any references or type
hints to match the new signature.
- Line 11: Remove the deprecated typing.Type import and update the annotation
that uses it: delete the "from typing import Type" import and change the
args_schema annotation (the one currently typed with Type[...] in this module,
e.g., the args_schema variable or parameter at line ~67) to use the built-in
generic type syntax "type[...]" instead; ensure any other occurrences in
crawl4ai.py are similarly replaced and run type checks to confirm no remaining
references to typing.Type remain.
In `@src/d4bl/app/schemas.py`:
- Line 14: The line declaring selected_agents currently exceeds the 100-char
limit; refactor the declaration in schemas.py by moving the inline comment to
its own preceding or trailing line and/or shortening the comment so the
declaration line stays under 100 chars; specifically modify the selected_agents:
list[str] | None = None declaration and its comment (e.g., split the comment
"List of agent names to run (e.g., [\"researcher\", \"writer\"])" onto a
separate line or shorten it) so the combined lines conform to the 100-character
max while preserving the same explanatory text.
In `@src/d4bl/infra/vector_store.py`:
- Around line 211-214: engine.py currently passes job_id: str | None into
search_similar which expects UUID | None; fix by converting the string to a UUID
before calling vector_store.search_similar. Locate where job_id is
declared/received in src/d4bl/query/engine.py (the variable named job_id) and
call the existing parse_job_uuid(job_id) (same helper used in api.py) to produce
a UUID | None, then pass that result into search_similar; alternatively, if you
prefer the other approach, update the search_similar signature in
vector_store.py to accept job_id: str | UUID | None and adjust internal handling
accordingly so type contracts align.
In `@src/d4bl/query/fusion.py`:
- Line 52: Update the constructor signature to include an explicit return
annotation "-> None" for the __init__ method in fusion.py: locate the def
__init__(self, ollama_base_url: str | None = None) definition and change it to
declare "-> None" so the constructor is properly annotated per PEP 8/type-hint
guidelines.
In `@src/d4bl/query/parser.py`:
- Line 45: The __init__ constructor in the Parser class (def __init__(self,
ollama_base_url: str | None = None)) is missing an explicit return type; update
the signature of __init__ to include the return annotation "-> None" (i.e., def
__init__(self, ollama_base_url: str | None = None) -> None:) so it complies with
the project's typing/coding guidelines.
In `@src/d4bl/services/langfuse/content_relevance.py`:
- Around line 18-25: Add a clear docstring to the public function
evaluate_content_relevance describing its purpose, parameters, and return value:
state that it evaluates relevance of extracted_contents (list[dict]) against a
query (str), explain optional parameters trace_id (str|None) for tracing, llm
and langfuse (Any) for optional integrations, and describe the returned
dict[str, Any] shape (e.g., relevance scores, metadata, timing). Keep it concise
and include behavior/side-effects (uses time measurement, may call llm/langfuse)
and any exceptions or edge cases.
In `@src/d4bl/services/langfuse/parsers.py`:
- Line 76: The function signature for parse_label_score is longer than 100
chars; break it across multiple lines so it conforms to the 100-char limit — for
example, put the parameters (text: str, mapping: dict[str, float],
default_score: float = 3.0) each on their own line and/or move the return
annotation tuple[float, str] to a new line; update the signature in the
parse_label_score definition accordingly so the function name and all type
annotations remain unchanged.
In `@src/d4bl/services/langfuse/runner.py`:
- Around line 60-67: Add a clear docstring to the public function
run_comprehensive_evaluation that briefly describes its purpose and behavior,
documents each parameter (query, research_output, sources, trace_id,
extracted_contents, report) with expected types and meanings, specifies the
return type (dict[str, Any]) and the structure of the returned data, and notes
any exceptions or side-effects (e.g., logging, external calls) so callers know
how to use it; place the docstring immediately below the def
run_comprehensive_evaluation(...) line in standard triple-quoted format.
In `@src/d4bl/services/research_runner.py`:
- Line 144: The parameter annotation for logs is too broad; find the function
signatures containing "logs: list | None = None" (the two occurrences in this
file, one at the shown signature and the other later) and tighten them to "logs:
list[str] | None = None" so the pipeline clearly documents that log entries are
strings; update both occurrences, run type checks, and adjust any related
docstrings or return annotations if they assume a specific list element type.
---
Outside diff comments:
In `@src/d4bl/agents/tools/crawl_tools/pdf_extraction.py`:
- Around line 155-156: Sanitize extracted_text before embedding into the
cleaned_html and html fields to prevent injected HTML/JS: in the code that
builds the dict entries "cleaned_html" and "html" (where they currently use
f"<html><body><pre>{extracted_text[:1000]}...</pre></body></html>"), replace raw
insertion with an escaped version of extracted_text (e.g., call a utility like
html.escape or an existing escape_html function on extracted_text[:1000]) and
then interpolate the escaped string so the pre block cannot render markup or
scripts.
- Line 89: The 'User-Agent' header string in the headers dict (in
src/d4bl/agents/tools/crawl_tools/pdf_extraction.py) exceeds the 100-char
line-length rule; break it into multiple shorter string literals (e.g., use
implicit adjacent string literals or concatenate with + inside the same
expression) for the 'User-Agent' value so the line length is under 100 chars
while preserving the exact header value (reference the headers dict and the
'User-Agent' key to locate the change).
- Around line 94-112: Replace the current full-buffer download (use of
response.content) with chunked streaming under a context manager: open the
response with "with requests.get(..., stream=True) as response:", call
response.raise_for_status(), then write response.iter_content(chunk_size=8192)
into the tempfile.NamedTemporaryFile while tracking a running byte counter and
aborting (closing response, deleting partial tmp file and raising/logging) if
the total exceeds a configured max_download_bytes (e.g., MAX_PDF_BYTES). Ensure
you still check Content-Type and URL suffix as before, set tmp_path from
tmp_file.name after successful completion, and use response.iter_content to
avoid buffering the whole response and to properly close the connection.
In `@src/d4bl/services/langfuse/hallucination.py`:
- Around line 16-24: Add a clear docstring to the public function
evaluate_hallucination describing its purpose (evaluate whether an LLM answer
hallucinates given a query and context), list and type each parameter (query:
str, answer: str, context: str, trace_id: Optional[str], llm: Any, langfuse:
Any), explain the returned value shape (dict[str, Any] and what keys/values
represent), and note side effects (e.g., may emit logs/events to langfuse or use
the llm client) and any exceptions raised; place the docstring immediately
inside the evaluate_hallucination function definition so tools and linters
recognize it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 20b7c021-8e01-48b3-9e43-0b14a5a10127
📒 Files selected for processing (26)
src/d4bl/agents/crew.pysrc/d4bl/agents/tools/crawl_tools/crawl4ai.pysrc/d4bl/agents/tools/crawl_tools/firecrawl.pysrc/d4bl/agents/tools/crawl_tools/pdf_extraction.pysrc/d4bl/agents/tools/crawl_tools/utils.pysrc/d4bl/app/api.pysrc/d4bl/app/schemas.pysrc/d4bl/app/websocket_manager.pysrc/d4bl/infra/vector_store.pysrc/d4bl/query/engine.pysrc/d4bl/query/fusion.pysrc/d4bl/query/parser.pysrc/d4bl/query/structured.pysrc/d4bl/services/error_handling.pysrc/d4bl/services/langfuse/_base.pysrc/d4bl/services/langfuse/bias.pysrc/d4bl/services/langfuse/client.pysrc/d4bl/services/langfuse/content_relevance.pysrc/d4bl/services/langfuse/hallucination.pysrc/d4bl/services/langfuse/parsers.pysrc/d4bl/services/langfuse/quality.pysrc/d4bl/services/langfuse/reference.pysrc/d4bl/services/langfuse/report_relevance.pysrc/d4bl/services/langfuse/runner.pysrc/d4bl/services/langfuse/source_relevance.pysrc/d4bl/services/research_runner.py
- Replace `typing.Type` with builtin `type[]` in crawl4ai.py - Remove unused `query` param from `_handle_pdfs_separately` - Move inline comment to own line in schemas.py (100-char limit) - Widen `search_similar` job_id to `UUID | str | None` for engine.py compat - Add `-> None` return annotations to `__init__` in fusion.py and parser.py - Split long function signature in parsers.py - Tighten `logs: list | None` to `list[str] | None` in research_runner.py
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/d4bl/services/langfuse/parsers.py (1)
18-25:⚠️ Potential issue | 🟠 MajorDo not silently collapse JSON parse failures into normal fallback output.
Line [24]-[25] returns
{}forJSONDecodeError, which is indistinguishable from “no JSON found.”
Downstream callers then treat this as a normal fallback path and can log default scores as if evaluation succeeded.Proposed fix
+import logging import json import re from typing import Any +logger = logging.getLogger(__name__) + def parse_first_json_block(text: str) -> dict[str, Any]: json_match = re.search(r"\{[^{}]*\}", text, re.DOTALL) if not json_match: return {} try: return json.loads(json_match.group()) except json.JSONDecodeError: + logger.warning( + "Failed to decode JSON block from evaluator output", + extra={"payload_preview": text[:200]}, + ) return {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/services/langfuse/parsers.py` around lines 18 - 25, The current parse_first_json_block function swallows json.JSONDecodeError and returns {} which is indistinguishable from "no JSON found"; update parse_first_json_block to not silently collapse parse failures—when json.loads(json_match.group()) raises json.JSONDecodeError, capture it and either re-raise an exception (preserving the original JSONDecodeError) or raise a new ValueError that includes the error message and the offending json_match.group() text so callers can distinguish parse errors from "no JSON found" and handle/log them appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/infra/vector_store.py`:
- Line 24: Update the constructor signature to include an explicit return type
annotation; change the __init__ method declaration (the one accepting
ollama_base_url and embedder_model) to annotate its return as -> None so it
reads as a proper constructor with a None return type in the vector store class.
In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 76-78: Add a docstring to the public function parse_label_score
that clearly documents the parameters (text: the input string to parse, mapping:
a dict[str, float] mapping label names to scores, default_score: float used when
no score can be determined), the return value (tuple[float, str] where the first
element is the numeric score and the second is the parsed label), and the
function behavior (how labels/scores are extracted from text, how mapping is
consulted and when default_score is returned if parsing fails). Include expected
types, any edge-case behavior (e.g., empty text or unknown label), and a short
usage example or note about return guarantees.
---
Outside diff comments:
In `@src/d4bl/services/langfuse/parsers.py`:
- Around line 18-25: The current parse_first_json_block function swallows
json.JSONDecodeError and returns {} which is indistinguishable from "no JSON
found"; update parse_first_json_block to not silently collapse parse
failures—when json.loads(json_match.group()) raises json.JSONDecodeError,
capture it and either re-raise an exception (preserving the original
JSONDecodeError) or raise a new ValueError that includes the error message and
the offending json_match.group() text so callers can distinguish parse errors
from "no JSON found" and handle/log them appropriately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7866e901-dc3c-4572-97ef-2e375ea410d2
📒 Files selected for processing (7)
src/d4bl/agents/tools/crawl_tools/crawl4ai.pysrc/d4bl/app/schemas.pysrc/d4bl/infra/vector_store.pysrc/d4bl/query/fusion.pysrc/d4bl/query/parser.pysrc/d4bl/services/langfuse/parsers.pysrc/d4bl/services/research_runner.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/d4bl/infra/vector_store.py (1)
24-24:⚠️ Potential issue | 🟡 MinorLine 24 exceeds the 100-character limit.
Please wrap the constructor signature to satisfy the repository’s Python line-length rule.
Suggested fix
- def __init__(self, ollama_base_url: str | None = None, embedder_model: str = "mxbai-embed-large") -> None: + def __init__( + self, + ollama_base_url: str | None = None, + embedder_model: str = "mxbai-embed-large", + ) -> None:As per coding guidelines
**/*.py: Enforce 100 character maximum line length in Python code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/d4bl/infra/vector_store.py` at line 24, The __init__ constructor signature in src/d4bl/infra/vector_store.py exceeds 100 chars; break the long def __init__(...) line into multiple lines to comply with the 100-character rule by placing parameters on their own indented lines (e.g., split ollama_base_url: str | None = None and embedder_model: str = "mxbai-embed-large" onto separate lines) while preserving type annotations and the return annotation -> None so the method signature and meaning of __init__ remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/d4bl/infra/vector_store.py`:
- Around line 277-278: The code currently uses truthy checks on the limit
parameter (e.g., "if limit") which treats limit=0 as "no limit" and can return
unbounded rows; update all checks that gate adding a LIMIT or slicing by the
parameter to use explicit None checks (replace "if limit" with "if limit is not
None" and "if not limit" with "if limit is None") so that a numeric 0 is honored
as a valid value in the function signature returning list[dict[str, Any]];
ensure the query-building / slicing logic only omits the LIMIT when limit is
None.
---
Duplicate comments:
In `@src/d4bl/infra/vector_store.py`:
- Line 24: The __init__ constructor signature in src/d4bl/infra/vector_store.py
exceeds 100 chars; break the long def __init__(...) line into multiple lines to
comply with the 100-character rule by placing parameters on their own indented
lines (e.g., split ollama_base_url: str | None = None and embedder_model: str =
"mxbai-embed-large" onto separate lines) while preserving type annotations and
the return annotation -> None so the method signature and meaning of __init__
remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d9022e11-95cd-4bb5-9b88-806b8ce7357f
📒 Files selected for processing (1)
src/d4bl/infra/vector_store.py
`if limit` treats 0 as falsy, skipping the LIMIT clause entirely. Use `if limit is not None` so limit=0 is honored correctly.
Summary
typingmodule imports with modern Python 3.10+ built-in equivalents across 26 filesDict→dict,List→list,Optional[X]→X | None,Union[X, Y]→X | YCallable→collections.abc.Callable,AsyncGenerator→collections.abc.AsyncGeneratorfrom __future__ import annotationsto 7 files that were missing itAny,TypeVar,Literal, andTyperemain intypingimports (not deprecated)Closes #31
Test plan
Optional[,Dict[,List[,Union[annotations remainingSummary by CodeRabbit
Refactor
New Features
Chores