Random space added inside words #3359

miro-igov · 2025-07-04T12:08:44Z

miro-igov
Jul 4, 2025

I am extracting text from PDF file with pypdf and it seems buggy. The extracted text contains word with spaces that do not exist in source pdf.
Example:
PDF contains phrase - Postchirurgie (strabisme, ptérygion)
Extracted text contains - P ostchirurgie (strabism e ,ptérygion)

Why are these extra spaces inside the words?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.12.25-0-virt-x86_64-with

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.7.0, crypt_provider=('cryptography', '44.0.2'), PIL=11.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


filepath = '/tmp/test.pdf'
pdfObj = open(filepath, 'rb')
reader = PdfReader(pdfObj)
out_txt = ''
for p in reader.pages:
    out_txt += p.extract_text()
pdfObj.close()

print(out_txt)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

test.pdf

stefan6419846 · 2025-07-04T13:35:31Z

stefan6419846
Jul 4, 2025
Maintainer

Thanks for your report. As mentioned in our docs, text extraction is hard. You might want to have a look at the different parameters and try different variants there.

If you want to further analyze this, feel free to do so. We appreciate PRs which improve the general accuracy of the extraction if there is actual room for improvement.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Random space added inside words #3359

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Random space added inside words #3359

Uh oh!

miro-igov Jul 4, 2025

Environment

Code + PDF

Replies: 1 comment

Uh oh!

stefan6419846 Jul 4, 2025 Maintainer

miro-igov
Jul 4, 2025

stefan6419846
Jul 4, 2025
Maintainer