OCR-enabled PDF text extraction built on pypdf and Azure Document Intelligence
pypdftotext is a Python package that intelligently extracts text from PDF files. It uses pypdf's advanced layout mode for embedded text extraction and seamlessly falls back to Azure Document Intelligence OCR when no embedded text is found.
- π Fast embedded text extraction using pypdf's layout mode
- π Automatic OCR fallback via Azure Document Intelligence when needed
- π§΅ Thread-safe operations with the
PdfExtract
class - π¦ S3 support for reading PDFs directly from AWS S3
- πΌοΈ Image compression to reduce PDF file sizes
- βοΈ Handwritten text detection with confidence scoring
- π Page manipulation - create child PDFs and extract page subsets
- βοΈ Flexible Configuration with built in env support multiple inheritance options
pip install pypdftotext
# Install with boto3 for S3 support
pip install "pypdftotext[s3]"
# Install with pillow for scanned pdf compression support
pip install "pypdftotext[image]"
# For all optional features (s3 and pillow)
pip install "pypdftotext[full]"
# For development (full + boto3-types[s3], pytest, pytest-cov)
pip install "pypdftotext[dev]"
- Python 3.10, 3.11, or 3.12
- pypdf 6.0
- azure-ai-documentintelligence >= 1.0.0
- tqdm (for progress bars)
- boto3 (optional)
- pillow (optional)
NOTE: If OCR has not been configured, only the text embedded directly in the pdf will be returned (using pypdf's layout mode).
- An Azure Subscription (create one for free)
- An Azure Document Intelligence resource (create one)
NOTE: The same behaviors apply to the AWS_* settings for pulling PDFs from S3.
export AZURE_DOCINTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com/"
export AZURE_DOCINTEL_SUBSCRIPTION_KEY="your-subscription-key"
from pypdftotext import constants
constants.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
constants.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
You can also set these values for individual instances of the PyPdfToTextConfig class, instances of which are exposed by the config
attribute of PdfExtract
and AzureDocIntelIntegrator
classes. See below.
from pypdftotext import PdfExtract
extract = PdfExtract("document.pdf")
NOTE: if you've set env vars or constants, setting the endpoint and subscription key is optional. However, it is still acceptable to set them (and any other config options) on the instance itself after creating it.
extract.config.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
extract.config.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
extract.config.PRESERVE_VERTICAL_WHITESPACE = True
text = extract.text
print(text)
# Get text by page
for i, page_text in enumerate(extract.text_pages):
print(f"Page {i + 1}: {page_text[:100]}...")
NOTE: Requires the optional
pypdftotext[images]
installation.
NOTE: Perform this step before accessing text/text_pages to use the compressed PDF for OCR. Otherwise, text will already be extracted from the original version and will not be re-extracted.
extract.compress_images( # always converts images to greyscale
white_point = 200, # pixels with values from 201 to 255 are set to 256 (aka white) to remove scanner artifacts
aspect_tolerance=0.01, # resizes images whose aspect ratios (width/height) are within 0.01 of the page aspect ratio
max_overscale = 1.5, # images having a width more than 1.5x the displayed width of the PDF page are downsampled to 1.5x
)
NOTE: If a scanned PDF contains upside down or rotated pages, these pages will be reoriented automatically during text extraction.
from pathlib import Path
Path("compressed_corrected_document.pdf").write_bytes(extract.body)
# create a new PdfExtract instance containing the first 10 pages of the original PDF.
extract_child = extract.child((0, 9)) # useful for passing config and metadata forward.
# get the bytes of a PDF containing pages 1, 3, and 5 without creating a new PdfExtract instance.
clipped_pages_pdf_bytes = extract_child.clip_pages([0, 2, 4]) # useful for quick splitting.
If an S3 URI (e.g. s3://my-bucket/path/to/document.pdf
) is supplied as the pdf
parameter, PdfExtract
will attempt to pull the bytes from the supplied bucket/key. AWS credentials with proper permissions must be supplied as env vars or set programmatically as described for Azure OCR above or an error will result.
OCR is automatically triggered when:
- The ratio of low-text pages exceeds
TRIGGER_OCR_PAGE_RATIO
(default: 99% of pages) - A page is considered "low-text" if it has β€
MIN_LINES_OCR_TRIGGER
lines (default: 1)
Example: OCR only when 50% of pages have fewer than 5 lines:
config = PyPdfToTextConfig(
MIN_LINES_OCR_TRIGGER=5,
TRIGGER_OCR_PAGE_RATIO=0.5
)
The PyPdfToTextConfig and PyPdfToTextConfigOverrides (optional) classes can be used to customize the operation of individual PdfExtract instances if desired.
- New PdfToTextConfig instances will first reinitialize all relevant settings from the env and then inherit any settings that have been set programmatically via
constants
. This allows users to globally set API keys (via env ORconstants
) and other desired behaviors (viaconstants
only) eliminating the need to supply theconfig
parameter to everyPdfExtract
instance. - Inheritance from the global constants can be disabled globally by setting
constants.INHERIT_CONSTANTS
to False or for a single PyPdfToTextConfig instance using theoverrides
parameter (e.g.PyPdfToTextConfig(overrides={"INHERIT_CONSTANTS": False})
). ThePdfToTextConfigOverrides
TypedDict is available for IDE and typing support. - An alternate
base
can be supplied to the PyPdfToTextConfig constructor. If supplied, its values supersede those in the globalconstants
. - If both a
base
andoverrides
are supplied, overlapping settings inoverrides
will supersede those inbase
(orconstants
).
This project is licensed under the MIT License - see the LICENSE file for details.
Built on top of:
- pypdf for PDF parsing
- Azure Document Intelligence for OCR capabilities