-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Issue #2959: support appending to existing PDFs #2965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…remnants of text writing from PdfImagePlugin
src/PIL/pdfParser.py
Outdated
| UserDict = collections.UserDict | ||
|
|
||
|
|
||
| if sys.version_info.major >= 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usual check in this codebase is if str == bytes:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/PIL/pdfParser.py
Outdated
| return pages | ||
|
|
||
|
|
||
| def selftest(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these should be in the Tests directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Done.
src/PIL/pdfParser.py
Outdated
| # XXX TODO delete Pages tree recursively | ||
|
|
||
| def read_pdf_info_from_file(self, f): | ||
| self.buf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to fail if a io.BytesIO object is passed in as the file like object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm not sure how relevant that is given that the entire idea is to append to existing files, but I suppose that doesn't preclude having the file in memory for some reason. Fixed now.
src/PIL/pdfParser.py
Outdated
| self.read_pdf_info_from_file(f) | ||
| elif filename is not None: | ||
| with open(filename, "rb") as f: | ||
| self.read_pdf_info_from_file(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uncomfortable here with self.buf as local state and not something explicitly passed into read_pdf_info, especially as there is at least one other value that may be set here, but not in the else case below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. Do you mean to eliminate self.buf completely and just pass buf around in all the calls to the methods that read and parse the PDF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In general, I prefer to see pure functions. If that's not feasable, a set of object attributes that are consistent over the life of the object. In this case, there's a function that (silently) requires self.buf, and self.buf is either null or not, depending on side effects in other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now changed the interface so that the parser keeps the file open and keeps a reference to it, making an open-read-append use case much easier. I have also eliminated all the passing of the file and buffer objects.
Codecov Report
@@ Coverage Diff @@
## master #2965 +/- ##
==========================================
+ Coverage 81.07% 83.66% +2.59%
==========================================
Files 167 168 +1
Lines 22605 23509 +904
Branches 2793 2793
==========================================
+ Hits 18326 19669 +1343
+ Misses 4279 3840 -439
Continue to review full report at Codecov.
|
…thods to support writing, eliminate the passing of file or buffer
|
As far as I can tell, the alleged coverage decrease reported by Coveralls seems to be caused by coverage reporting apparently not working on the Python 3.7 build. In fact, coverage should have increased with this patch. |
|
I wouldn't worry too much about Coveralls, there's also an issue about it here: #2934. We can look at Codecov instead. |
|
So... merge please? :) |
Tests/test_file_pdf.py
Outdated
| pdf.info.Keywords = "qw)e\\r(ty" | ||
| pdf.info.Creator = "hopper()" | ||
| pdf.start_writing() | ||
| pdf.write_xref_and_trailer(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undefined name 'f', this test would fail if it was run
Tests/test_file_pdf.py
Outdated
| self.assertEqual(pdf.info.Keywords, "qw)e\\r(ty") | ||
| self.assertEqual(pdf.info.Subject, u"ghi\uABCD") | ||
|
|
||
| def test_pdf_append(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redefinition of test_pdf_append(), so only one of these is run
| def test_pdf_open(self): | ||
| # fail on a buffer full of null bytes | ||
| self.assertRaises(pdfParser.PdfFormatError, pdfParser.PdfParser, buf=bytearray(65536)) | ||
| # make an empty PDF object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a newline before these comments to space things out a bit
| def test_pdf_append(self): | ||
| # make a PDF file | ||
| pdf_filename = self.helper_save_as_pdf("RGB", producer="pdfParser") | ||
| # open it, check pages and info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newlines before comments to help readability
Tests/test_pdfparser.py
Outdated
| @@ -0,0 +1,89 @@ | |||
| from helper import unittest, PillowTestCase | |||
|
|
|||
| from PIL.pdfParser import * | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's replace this star import with the specific imports needed
| nesting_depth -= 1 | ||
| offset = m.end() | ||
| raise PdfFormatError("unfinished literal string") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove one newline
src/PIL/pdfParser.py
Outdated
| if m.group(1): | ||
| result.extend(klass.escaped_chars[m.group(1)[1]]) | ||
| elif m.group(2): | ||
| #result.append(eval(m.group(1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this commented line be removed?
| re_xref_section_start = re.compile(whitespace_optional + br"xref" + newline) | ||
| re_xref_subsection_start = re.compile(whitespace_optional + br"([0-9]+)" + whitespace_mandatory + br"([0-9]+)" + whitespace_optional + newline_only) | ||
| re_xref_entry = re.compile(br"([0-9]{10}) ([0-9]{5}) ([fn])( \r| \n|\r\n)") | ||
| def read_xref_table(self, xref_section_offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline before this
src/PIL/pdfParser.py
Outdated
| self.xref_table[i] = new_entry | ||
| return offset | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove a newline
| assert generation == ref[1] | ||
| return self.get_value(self.buf, offset + self.start_offset, expect_indirect=IndirectReference(*ref), max_nesting=max_nesting)[0] | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove a newline
|
Thanks for the comprehensive review! Everything should be done now. |
|
Is there anything I can do to help/encourage this forward? Thanks for your time. |
|
Implement a 29 hour day? I'm backed up with paid stuff now. I've been hoping to get to PR review, but it just hasn't happened this month. |
wiredfool
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the asserts are converted to exceptions
| for imSequence in ims: | ||
| for im in ImageSequence.Iterator(imSequence): | ||
| # FIXME: Should replace ASCIIHexDecode with RunLengthDecode (packbits) | ||
| # or LZWDecode (tiff/lzw compression). Note that PDF 1.2 also supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that I don't trust any of the pillow internal packbits or LZW compression methods to be correct, as they appear to lead to corrupt images in some of the tiff tests. We've been moving off of them in favor of libtiff.
Flatedecode should be ok, as that's a a library function already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, thanks for the heads up. This is just moving around code that was already there, though, so I don't feel like doing anything about it in this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
src/PIL/PdfParser.py
Outdated
| offset = m.end() | ||
| m = klass.re_indirect_def_start.match(data, offset) | ||
| if m: | ||
| assert int(m.group(1)) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise an exception rather than assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (and elsewhere).
|
Just discovered a bug in writing a Page dict's Parent. Gimme a few minutes... |
|
OK, that was a bit more than a few minutes but all is done now. Thanks for merging! |
| self.assertEqual(pdf.info.Title, "abc") | ||
| self.check_pdf_pages_consistency(pdf) | ||
|
|
||
| # append two images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is incorrect, yes? Only one image is being appended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, it is appending the mode_CMYK and mode_P images, and asserting that len(pdf.pages) is 3 a few lines below (where a few lines above it was 1). Am I missing something?
Merge please? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I see. Thanks.
|
|
||
| **creator** | ||
| If the document was converted to PDF from another format, the name of the | ||
| conforming product that created the original document from which it was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "conforming" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a direct quote from the spec, see http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
And here's another direct quote, chapter 2.4 Conforming products:
A conforming product shall comply with all requirements regarding the creation of PDF files as specified in ISO 32000-1 as well as comply with all requirements regarding reader functional behavior specified in ISO 32000-1.
|
Thanks! |
Fixes #2959.
Changes proposed in this pull request: