Issue #2959: support appending to existing PDFs #2965

vashek · 2018-01-18T13:36:23Z

Changes proposed in this pull request:

add pdfParser
rework PdfImagePlugin to use pdfParser
add appending support to PdfImagePlugin (with kw arg append=True)
fix the case when multiple images to be written to a PDF have different modes

…remnants of text writing from PdfImagePlugin

wiredfool · 2018-01-24T11:22:45Z

src/PIL/pdfParser.py

+    UserDict = collections.UserDict
+
+
+if sys.version_info.major >= 3:


The usual check in this codebase is if str == bytes:

wiredfool · 2018-01-24T11:28:09Z

src/PIL/pdfParser.py

+        return pages
+
+
+def selftest():


All of these should be in the Tests directory.

Right. Done.

wiredfool · 2018-01-24T11:39:31Z

src/PIL/pdfParser.py

+        # XXX TODO delete Pages tree recursively
+
+    def read_pdf_info_from_file(self, f):
+        self.buf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)


I think this is going to fail if a io.BytesIO object is passed in as the file like object.

Good point. I'm not sure how relevant that is given that the entire idea is to append to existing files, but I suppose that doesn't preclude having the file in memory for some reason. Fixed now.

wiredfool · 2018-01-24T12:08:28Z

src/PIL/pdfParser.py

+            self.read_pdf_info_from_file(f)
+        elif filename is not None:
+            with open(filename, "rb") as f:
+                self.read_pdf_info_from_file(f)


I'm a little uncomfortable here with self.buf as local state and not something explicitly passed into read_pdf_info, especially as there is at least one other value that may be set here, but not in the else case below.

I'm not sure I understand. Do you mean to eliminate self.buf completely and just pass buf around in all the calls to the methods that read and parse the PDF?

Yes. In general, I prefer to see pure functions. If that's not feasable, a set of object attributes that are consistent over the life of the object. In this case, there's a function that (silently) requires self.buf, and self.buf is either null or not, depending on side effects in other functions.

I have now changed the interface so that the parser keeps the file open and keeps a reference to it, making an open-read-append use case much easier. I have also eliminated all the passing of the file and buffer objects.

…for some PDFs

…, updated tests

codecov · 2018-01-24T23:20:43Z

Codecov Report

Merging #2965 into master will increase coverage by 2.59%.
The diff coverage is 95.92%.

@@            Coverage Diff             @@
##           master    #2965      +/-   ##
==========================================
+ Coverage   81.07%   83.66%   +2.59%     
==========================================
  Files         167      168       +1     
  Lines       22605    23509     +904     
  Branches     2793     2793              
==========================================
+ Hits        18326    19669    +1343     
+ Misses       4279     3840     -439

Impacted Files	Coverage Δ
src/PIL/Image.py	`91.11% <100%> (+0.53%)`	⬆️
src/PIL/pdfParser.py	`95.87% <95.87%> (ø)`
src/PIL/PdfImagePlugin.py	`94.73% <96.15%> (-1.54%)`	⬇️
src/PIL/ImageTk.py	`75% <0%> (+0.78%)`	⬆️
src/PIL/ImageCms.py	`85.58% <0%> (+1.74%)`	⬆️
src/PIL/ImageFile.py	`91.23% <0%> (+2.11%)`	⬆️
src/PIL/ImageGrab.py	`18.18% <0%> (+2.27%)`	⬆️
src/PIL/IcnsImagePlugin.py	`79.78% <0%> (+4.78%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9ea737...bc01bb6. Read the comment docs.

…, tests

….x? ;-)

…thods to support writing, eliminate the passing of file or buffer

vashek · 2018-01-29T11:42:01Z

As far as I can tell, the alleged coverage decrease reported by Coveralls seems to be caused by coverage reporting apparently not working on the Python 3.7 build. In fact, coverage should have increased with this patch.

hugovk · 2018-01-29T12:05:00Z

I wouldn't worry too much about Coveralls, there's also an issue about it here: #2934. We can look at Codecov instead.

vashek · 2018-01-29T15:31:05Z

So... merge please? :)

hugovk · 2018-01-29T15:35:29Z

Tests/test_file_pdf.py

+            pdf.info.Keywords = "qw)e\\r(ty"
+            pdf.info.Creator = "hopper()"
+            pdf.start_writing()
+            pdf.write_xref_and_trailer(f)


Undefined name 'f', this test would fail if it was run

hugovk · 2018-01-29T15:36:36Z

Tests/test_file_pdf.py

+            self.assertEqual(pdf.info.Keywords, "qw)e\\r(ty")
+            self.assertEqual(pdf.info.Subject, u"ghi\uABCD")
+
+    def test_pdf_append(self):


Redefinition of test_pdf_append(), so only one of these is run

hugovk · 2018-01-29T15:38:32Z

Tests/test_file_pdf.py

+    def test_pdf_open(self):
+        # fail on a buffer full of null bytes
+        self.assertRaises(pdfParser.PdfFormatError, pdfParser.PdfParser, buf=bytearray(65536))
+        # make an empty PDF object


Maybe add a newline before these comments to space things out a bit

hugovk · 2018-01-29T15:42:03Z

Tests/test_file_pdf.py

+    def test_pdf_append(self):
+        # make a PDF file
+        pdf_filename = self.helper_save_as_pdf("RGB", producer="pdfParser")
+        # open it, check pages and info


Newlines before comments to help readability

hugovk · 2018-01-29T15:44:34Z

Tests/test_pdfparser.py

@@ -0,0 +1,89 @@
+from helper import unittest, PillowTestCase
+
+from PIL.pdfParser import *


Let's replace this star import with the specific imports needed

hugovk · 2018-01-29T16:20:45Z

src/PIL/pdfParser.py

+                nesting_depth -= 1
+            offset = m.end()
+        raise PdfFormatError("unfinished literal string")
+


Remove one newline

hugovk · 2018-01-29T16:20:55Z

src/PIL/pdfParser.py

+            if m.group(1):
+                result.extend(klass.escaped_chars[m.group(1)[1]])
+            elif m.group(2):
+                #result.append(eval(m.group(1)))


Can this commented line be removed?

hugovk · 2018-01-29T16:21:09Z

src/PIL/pdfParser.py

+    re_xref_section_start = re.compile(whitespace_optional + br"xref" + newline)
+    re_xref_subsection_start = re.compile(whitespace_optional + br"([0-9]+)" + whitespace_mandatory + br"([0-9]+)" + whitespace_optional + newline_only)
+    re_xref_entry = re.compile(br"([0-9]{10}) ([0-9]{5}) ([fn])( \r| \n|\r\n)")
+    def read_xref_table(self, xref_section_offset):


Newline before this

hugovk · 2018-01-29T16:21:24Z

src/PIL/pdfParser.py

+                    self.xref_table[i] = new_entry
+        return offset
+
+


Remove a newline

hugovk · 2018-01-29T16:21:32Z

src/PIL/pdfParser.py

+        assert generation == ref[1]
+        return self.get_value(self.buf, offset + self.start_offset, expect_indirect=IndirectReference(*ref), max_nesting=max_nesting)[0]
+
+


Remove a newline

vashek · 2018-01-30T23:35:21Z

Thanks for the comprehensive review! Everything should be done now.

…at broke builds

vashek · 2018-02-27T10:17:15Z

Is there anything I can do to help/encourage this forward? Thanks for your time.

wiredfool · 2018-02-27T12:13:51Z

Implement a 29 hour day?

I'm backed up with paid stuff now. I've been hoping to get to PR review, but it just hasn't happened this month.

wiredfool

LGTM once the asserts are converted to exceptions

wiredfool · 2018-03-03T13:04:39Z

src/PIL/PdfImagePlugin.py

    for imSequence in ims:
        for im in ImageSequence.Iterator(imSequence):
+            # FIXME: Should replace ASCIIHexDecode with RunLengthDecode (packbits)
+            # or LZWDecode (tiff/lzw compression).  Note that PDF 1.2 also supports


Just a note that I don't trust any of the pillow internal packbits or LZW compression methods to be correct, as they appear to lead to corrupt images in some of the tiff tests. We've been moving off of them in favor of libtiff.

Flatedecode should be ok, as that's a a library function already.

Understood, thanks for the heads up. This is just moving around code that was already there, though, so I don't feel like doing anything about it in this pull request.

Fair enough.

wiredfool · 2018-03-03T13:15:34Z

src/PIL/PdfParser.py

+            offset = m.end()
+        m = klass.re_indirect_def_start.match(data, offset)
+        if m:
+            assert int(m.group(1)) > 0


Please raise an exception rather than assert.

Done (and elsewhere).

vashek · 2018-03-03T20:23:20Z

Just discovered a bug in writing a Page dict's Parent. Gimme a few minutes...

…ts when appending

…-build

vashek · 2018-03-03T23:23:42Z

OK, that was a bit more than a few minutes but all is done now. Thanks for merging!

radarhere · 2018-03-08T08:44:18Z

Tests/test_file_pdf.py

+            self.assertEqual(pdf.info.Title, "abc")
+            self.check_pdf_pages_consistency(pdf)
+
+        # append two images


This comment is incorrect, yes? Only one image is being appended.

Actually, no, it is appending the mode_CMYK and mode_P images, and asserting that len(pdf.pages) is 3 a few lines below (where a few lines above it was 1). Am I missing something?

Merge please? ;-)

Ah, yes, I see. Thanks.

hugovk · 2018-03-13T07:03:39Z

docs/handbook/image-file-formats.rst

+
+**creator**
+    If the document was converted to PDF from another format, the name of the
+    conforming product that created the original document from which it was


What does "conforming" mean here?

This is a direct quote from the spec, see http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
And here's another direct quote, chapter 2.4 Conforming products:
A conforming product shall comply with all requirements regarding the creation of PDF files as specified in ISO 32000-1 as well as comply with all requirements regarding reader functional behavior specified in ISO 32000-1.

hugovk · 2018-03-13T09:30:30Z

Thanks!

issue python-pillow#2959: support appending to existing PDFs

6207b44

radarhere changed the title ~~issue #2959: support appending to existing PDFs~~ Issue #2959: support appending to existing PDFs Jan 19, 2018

vashek added 4 commits January 22, 2018 16:39

issue python-pillow#2959: fix Python 3.4 not supporting bytes%tuple

65112ba

issue python-pillow#2959: pdfParser selftest

ba211ff

issue python-pillow#2959: add tests and fixes, text encoding, remove …

a187a36

…remnants of text writing from PdfImagePlugin

issue python-pillow#2959: fix test for nonexistent PDF file

cfacf8b

wiredfool reviewed Jan 24, 2018

View reviewed changes

vashek added 2 commits January 24, 2018 22:45

issue python-pillow#2959: change Py3 detection, fix trailer location …

991f832

…for some PDFs

issue python-pillow#2959: text string decoding, support for Info dict…

13fe1a5

…, updated tests

vashek added 11 commits January 25, 2018 00:22

issue python-pillow#2959: update documentation

bc01bb6

issue python-pillow#2959: support io.BytesIO objects

84f8747

issue python-pillow#2959: move pdfParser self tests to Tests directory

95f5c8d

issue python-pillow#2959: fix broken test

f956687

issue python-pillow#2959: improve Info setting and dumping

4d3b13f

issue python-pillow#2959: fix PdfDict attribute access, text decoding…

53ce9ec

…, tests

issue python-pillow#2959: oops, hopefully fix Python 2.x

51bed10

issue python-pillow#2959: argh, do we really need to support Python 2…

524addc

….x? ;-)

issue python-pillow#2959: another Py2 bugfix

971837c

issue python-pillow#2959: support streams, add some tests

78fe32a

issue python-pillow#2959: keep file open, add context manager, add me…

ede57b9

…thods to support writing, eliminate the passing of file or buffer

hugovk reviewed Jan 29, 2018

View reviewed changes

issue python-pillow#2959: changes based on @hugovk's review

9be8d66

vashek added 2 commits January 31, 2018 00:35

issue python-pillow#2959: rename pdfParser.py to PdfParser.py

c15a0b2

issue python-pillow#2959: oops. sorry. reverting accidental change th…

4cea610

…at broke builds

wiredfool approved these changes Mar 3, 2018

View reviewed changes

issue python-pillow#2959: change asserts into raises

113d672

vashek added 2 commits March 3, 2018 23:32

issue python-pillow#2959: fix wrong Parent of pre-existing Page objec…

24ecfe3

…ts when appending

issue python-pillow#2959: enhance test, mainly to trigger Appveyor re…

928bea3

…-build

radarhere reviewed Mar 8, 2018

View reviewed changes

hugovk reviewed Mar 13, 2018

View reviewed changes

hugovk merged commit ddc9e73 into python-pillow:master Mar 13, 2018

radarhere mentioned this pull request Dec 28, 2019

Lazy loading images when creating a pdf #4067

Closed

radarhere mentioned this pull request Apr 20, 2023

Use later value for duplicate xref entries in PdfParser #7102

Merged

		UserDict = collections.UserDict


		if sys.version_info.major >= 3:

		@@ -0,0 +1,89 @@
		from helper import unittest, PillowTestCase

		from PIL.pdfParser import *

		assert generation == ref[1]
		return self.get_value(self.buf, offset + self.start_offset, expect_indirect=IndirectReference(*ref), max_nesting=max_nesting)[0]

Uh oh!

Issue #2959: support appending to existing PDFs #2965

Issue #2959: support appending to existing PDFs #2965

Uh oh!

Conversation

vashek commented Jan 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 24, 2018

Codecov Report

Uh oh!

vashek commented Jan 29, 2018

Uh oh!

hugovk commented Jan 29, 2018

Uh oh!

vashek commented Jan 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vashek commented Jan 30, 2018

Uh oh!

vashek commented Feb 27, 2018

Uh oh!

wiredfool commented Feb 27, 2018

Uh oh!

wiredfool left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vashek commented Mar 3, 2018

Uh oh!

vashek commented Mar 3, 2018

Uh oh!

Choose a reason for hiding this comment