Skip to content

Conversation

@vashek
Copy link
Contributor

@vashek vashek commented Jan 18, 2018

Fixes #2959.

Changes proposed in this pull request:

  • add pdfParser
  • rework PdfImagePlugin to use pdfParser
  • add appending support to PdfImagePlugin (with kw arg append=True)
  • fix the case when multiple images to be written to a PDF have different modes

@radarhere radarhere changed the title issue #2959: support appending to existing PDFs Issue #2959: support appending to existing PDFs Jan 19, 2018
UserDict = collections.UserDict


if sys.version_info.major >= 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual check in this codebase is if str == bytes:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return pages


def selftest():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these should be in the Tests directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Done.

# XXX TODO delete Pages tree recursively

def read_pdf_info_from_file(self, f):
self.buf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to fail if a io.BytesIO object is passed in as the file like object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm not sure how relevant that is given that the entire idea is to append to existing files, but I suppose that doesn't preclude having the file in memory for some reason. Fixed now.

self.read_pdf_info_from_file(f)
elif filename is not None:
with open(filename, "rb") as f:
self.read_pdf_info_from_file(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uncomfortable here with self.buf as local state and not something explicitly passed into read_pdf_info, especially as there is at least one other value that may be set here, but not in the else case below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. Do you mean to eliminate self.buf completely and just pass buf around in all the calls to the methods that read and parse the PDF?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In general, I prefer to see pure functions. If that's not feasable, a set of object attributes that are consistent over the life of the object. In this case, there's a function that (silently) requires self.buf, and self.buf is either null or not, depending on side effects in other functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now changed the interface so that the parser keeps the file open and keeps a reference to it, making an open-read-append use case much easier. I have also eliminated all the passing of the file and buffer objects.

@codecov
Copy link

codecov bot commented Jan 24, 2018

Codecov Report

Merging #2965 into master will increase coverage by 2.59%.
The diff coverage is 95.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2965      +/-   ##
==========================================
+ Coverage   81.07%   83.66%   +2.59%     
==========================================
  Files         167      168       +1     
  Lines       22605    23509     +904     
  Branches     2793     2793              
==========================================
+ Hits        18326    19669    +1343     
+ Misses       4279     3840     -439
Impacted Files Coverage Δ
src/PIL/Image.py 91.11% <100%> (+0.53%) ⬆️
src/PIL/pdfParser.py 95.87% <95.87%> (ø)
src/PIL/PdfImagePlugin.py 94.73% <96.15%> (-1.54%) ⬇️
src/PIL/ImageTk.py 75% <0%> (+0.78%) ⬆️
src/PIL/ImageCms.py 85.58% <0%> (+1.74%) ⬆️
src/PIL/ImageFile.py 91.23% <0%> (+2.11%) ⬆️
src/PIL/ImageGrab.py 18.18% <0%> (+2.27%) ⬆️
src/PIL/IcnsImagePlugin.py 79.78% <0%> (+4.78%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9ea737...bc01bb6. Read the comment docs.

@vashek
Copy link
Contributor Author

vashek commented Jan 29, 2018

As far as I can tell, the alleged coverage decrease reported by Coveralls seems to be caused by coverage reporting apparently not working on the Python 3.7 build. In fact, coverage should have increased with this patch.

@hugovk
Copy link
Member

hugovk commented Jan 29, 2018

I wouldn't worry too much about Coveralls, there's also an issue about it here: #2934. We can look at Codecov instead.

@vashek
Copy link
Contributor Author

vashek commented Jan 29, 2018

So... merge please? :)

pdf.info.Keywords = "qw)e\\r(ty"
pdf.info.Creator = "hopper()"
pdf.start_writing()
pdf.write_xref_and_trailer(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefined name 'f', this test would fail if it was run

self.assertEqual(pdf.info.Keywords, "qw)e\\r(ty")
self.assertEqual(pdf.info.Subject, u"ghi\uABCD")

def test_pdf_append(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redefinition of test_pdf_append(), so only one of these is run

def test_pdf_open(self):
# fail on a buffer full of null bytes
self.assertRaises(pdfParser.PdfFormatError, pdfParser.PdfParser, buf=bytearray(65536))
# make an empty PDF object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a newline before these comments to space things out a bit

def test_pdf_append(self):
# make a PDF file
pdf_filename = self.helper_save_as_pdf("RGB", producer="pdfParser")
# open it, check pages and info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newlines before comments to help readability

@@ -0,0 +1,89 @@
from helper import unittest, PillowTestCase

from PIL.pdfParser import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace this star import with the specific imports needed

nesting_depth -= 1
offset = m.end()
raise PdfFormatError("unfinished literal string")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove one newline

if m.group(1):
result.extend(klass.escaped_chars[m.group(1)[1]])
elif m.group(2):
#result.append(eval(m.group(1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this commented line be removed?

re_xref_section_start = re.compile(whitespace_optional + br"xref" + newline)
re_xref_subsection_start = re.compile(whitespace_optional + br"([0-9]+)" + whitespace_mandatory + br"([0-9]+)" + whitespace_optional + newline_only)
re_xref_entry = re.compile(br"([0-9]{10}) ([0-9]{5}) ([fn])( \r| \n|\r\n)")
def read_xref_table(self, xref_section_offset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline before this

self.xref_table[i] = new_entry
return offset


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove a newline

assert generation == ref[1]
return self.get_value(self.buf, offset + self.start_offset, expect_indirect=IndirectReference(*ref), max_nesting=max_nesting)[0]


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove a newline

@vashek
Copy link
Contributor Author

vashek commented Jan 30, 2018

Thanks for the comprehensive review! Everything should be done now.

@vashek
Copy link
Contributor Author

vashek commented Feb 27, 2018

Is there anything I can do to help/encourage this forward? Thanks for your time.

@wiredfool
Copy link
Member

Implement a 29 hour day?

I'm backed up with paid stuff now. I've been hoping to get to PR review, but it just hasn't happened this month.

Copy link
Member

@wiredfool wiredfool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the asserts are converted to exceptions

for imSequence in ims:
for im in ImageSequence.Iterator(imSequence):
# FIXME: Should replace ASCIIHexDecode with RunLengthDecode (packbits)
# or LZWDecode (tiff/lzw compression). Note that PDF 1.2 also supports
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that I don't trust any of the pillow internal packbits or LZW compression methods to be correct, as they appear to lead to corrupt images in some of the tiff tests. We've been moving off of them in favor of libtiff.

Flatedecode should be ok, as that's a a library function already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks for the heads up. This is just moving around code that was already there, though, so I don't feel like doing anything about it in this pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

offset = m.end()
m = klass.re_indirect_def_start.match(data, offset)
if m:
assert int(m.group(1)) > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise an exception rather than assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (and elsewhere).

@vashek
Copy link
Contributor Author

vashek commented Mar 3, 2018

Just discovered a bug in writing a Page dict's Parent. Gimme a few minutes...

@vashek
Copy link
Contributor Author

vashek commented Mar 3, 2018

OK, that was a bit more than a few minutes but all is done now. Thanks for merging!

self.assertEqual(pdf.info.Title, "abc")
self.check_pdf_pages_consistency(pdf)

# append two images
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is incorrect, yes? Only one image is being appended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no, it is appending the mode_CMYK and mode_P images, and asserting that len(pdf.pages) is 3 a few lines below (where a few lines above it was 1). Am I missing something?

Merge please? ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I see. Thanks.


**creator**
If the document was converted to PDF from another format, the name of the
conforming product that created the original document from which it was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "conforming" mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a direct quote from the spec, see http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
And here's another direct quote, chapter 2.4 Conforming products:
A conforming product shall comply with all requirements regarding the creation of PDF files as specified in ISO 32000-1 as well as comply with all requirements regarding reader functional behavior specified in ISO 32000-1.

@hugovk hugovk merged commit ddc9e73 into python-pillow:master Mar 13, 2018
@hugovk
Copy link
Member

hugovk commented Mar 13, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for appending to existing PDF / TIFF files

4 participants