Skip to content

Conversation

jamesbraza
Copy link
Collaborator

Just DRYing up test_chunk_metadata_reader and confirming read_doc behaviors are as expected

@jamesbraza jamesbraza self-assigned this Jul 2, 2025
@jamesbraza jamesbraza added the enhancement New feature or request label Jul 2, 2025
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jul 2, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors test_chunk_metadata_reader to remove hard-coded chunk size values by using metadata.chunk_metadata.chunk_chars dynamically, DRYs up length and count assertions across tests, and adds new checks to verify that chunk names reflect correct page ranges.

  • Replaced hard-coded 3000 chunk size in assertions with metadata.chunk_metadata.chunk_chars
  • Added assertions that parse and validate page ranges from chunk.text.name
  • Updated overlap tests for HTML and code inputs to use dynamic metadata values
Comments suppressed due to low confidence (3)

tests/test_paperqa.py:1170

  • [nitpick] The variable name stlast_page is unclear. Consider renaming it to something like start_last_page or last_chunk_start_page for better readability.
    stlast_page, last_page = chunk_text[-1].name.rsplit(" ", maxsplit=1)[-1].split("-")

tests/test_paperqa.py:1193

  • [nitpick] This assertion lacks a custom error message. Adding a descriptive message will make failures easier to diagnose (e.g., include expected vs actual length conditions).
    assert all(

tests/test_paperqa.py:1217

  • [nitpick] Consider adding a custom failure message to this assertion so that if it fails in the code-based chunk test, the output clearly indicates the intended length constraint.
        assert all(

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 2, 2025
@jamesbraza jamesbraza merged commit 1668c12 into main Jul 2, 2025
5 checks passed
@jamesbraza jamesbraza deleted the better-chunking-assertions branch July 2, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants