Skip to content

Fix: ensure PDFs are treated as multi-page in load_file (fixes #30)#81

Open
PiyushInt wants to merge 1 commit intodatalab-to:masterfrom
PiyushInt:fix-batch-pdf-inputs
Open

Fix: ensure PDFs are treated as multi-page in load_file (fixes #30)#81
PiyushInt wants to merge 1 commit intodatalab-to:masterfrom
PiyushInt:fix-batch-pdf-inputs

Conversation

@PiyushInt
Copy link
Copy Markdown

Summary

This PR improves PDF handling so that multi-page PDFs are always routed through the PDF loader, which returns one image per page. Previously, if filetype.guess failed to recognize a PDF from its header, load_file would treat the file as a single image, so only the first page would ever be processed, even when the caller expected a batch of pages.

Changes

  • Update load_file in chandra/input.py to:
    • Prefer header-based detection via filetype.guess.
    • Fall back to the .pdf file extension when filetype.guess returns None.
    • Always call load_pdf_images for PDFs, ensuring multi-page behavior.
  • Add tests/test_input_loader.py to:
    • Mock filetype.guess to return None.
    • Mock load_pdf_images to simulate a 3-page PDF.
    • Assert that load_file("dummy.pdf", {"page_range": "0-2"}) returns three “pages” and that the parsed page range [0, 1, 2] is passed through.

Relation to issue #30

Issue #30 reports that only the first image is processed when using batch inputs from a PDF. When load_file misclassifies a PDF as a single image (because filetype.guess fails), the caller only ever gets one image and therefore only one BatchInputItem and result. This change makes PDF detection more robust and aligns load_file behavior with the expectation that a .pdf path yields multiple images (one per page), even if header detection fails.

Fixes #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Only the FIRST image will be processed in batch inputs

1 participant