Fix: ensure PDFs are treated as multi-page in load_file (fixes #30) by PiyushInt · Pull Request #81 · datalab-to/chandra

PiyushInt · 2026-03-27T14:47:49Z

Summary

This PR improves PDF handling so that multi-page PDFs are always routed through the PDF loader, which returns one image per page. Previously, if filetype.guess failed to recognize a PDF from its header, load_file would treat the file as a single image, so only the first page would ever be processed, even when the caller expected a batch of pages.

Changes

Update load_file in chandra/input.py to:
- Prefer header-based detection via filetype.guess.
- Fall back to the .pdf file extension when filetype.guess returns None.
- Always call load_pdf_images for PDFs, ensuring multi-page behavior.
Add tests/test_input_loader.py to:
- Mock filetype.guess to return None.
- Mock load_pdf_images to simulate a 3-page PDF.
- Assert that load_file("dummy.pdf", {"page_range": "0-2"}) returns three “pages” and that the parsed page range [0, 1, 2] is passed through.

Relation to issue #30

Issue #30 reports that only the first image is processed when using batch inputs from a PDF. When load_file misclassifies a PDF as a single image (because filetype.guess fails), the caller only ever gets one image and therefore only one BatchInputItem and result. This change makes PDF detection more robust and aligns load_file behavior with the expectation that a .pdf path yields multiple images (one per page), even if header detection fails.

Fixes #30

…b-to#30)

Fix: ensure PDFs are treated as multi-page in load_file (fixes datala…

551873a

…b-to#30)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: ensure PDFs are treated as multi-page in load_file (fixes #30)#81

Fix: ensure PDFs are treated as multi-page in load_file (fixes #30)#81
PiyushInt wants to merge 1 commit intodatalab-to:masterfrom
PiyushInt:fix-batch-pdf-inputs

PiyushInt commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PiyushInt commented Mar 27, 2026

Summary

Changes

Relation to issue #30

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant