Fix: ensure PDFs are treated as multi-page in load_file (fixes #30)#81
Open
PiyushInt wants to merge 1 commit intodatalab-to:masterfrom
Open
Fix: ensure PDFs are treated as multi-page in load_file (fixes #30)#81PiyushInt wants to merge 1 commit intodatalab-to:masterfrom
PiyushInt wants to merge 1 commit intodatalab-to:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves PDF handling so that multi-page PDFs are always routed through the PDF loader, which returns one image per page. Previously, if
filetype.guessfailed to recognize a PDF from its header,load_filewould treat the file as a single image, so only the first page would ever be processed, even when the caller expected a batch of pages.Changes
load_fileinchandra/input.pyto:filetype.guess..pdffile extension whenfiletype.guessreturnsNone.load_pdf_imagesfor PDFs, ensuring multi-page behavior.tests/test_input_loader.pyto:filetype.guessto returnNone.load_pdf_imagesto simulate a 3-page PDF.load_file("dummy.pdf", {"page_range": "0-2"})returns three “pages” and that the parsed page range[0, 1, 2]is passed through.Relation to issue #30
Issue #30 reports that only the first image is processed when using batch inputs from a PDF. When
load_filemisclassifies a PDF as a single image (becausefiletype.guessfails), the caller only ever gets one image and therefore only oneBatchInputItemand result. This change makes PDF detection more robust and alignsload_filebehavior with the expectation that a.pdfpath yields multiple images (one per page), even if header detection fails.Fixes #30