ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363

PJBrs · 2025-07-07T18:01:25Z

This patch includes font metrics for the standard 14 fonts. This is intended to be useful for generating a text appearance stream, especially if you want to take into account right-aligned or centred text. (I have some other patches that include this as well as text wrapping.)

Note that some of this information was already included in _font_widths.py, but that information is incomplete. I thought it better to copy this information from pdfminer.six and be able to potentially benefit from their work later on, than to improve on what already was included here.

The first three patches introduce the new functionality. The last three patches are for moving the Font class to the new font metrics information and removing the old _font_widths.py file.

This is what the spec has to say about it:

9.6.2.2 Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)

The PostScript language names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows:
Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats,
Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, CourierBoldOblique.

In PDF 1.0 to PDF 1.7, the FirstChar, LastChar, Widths and FontDescriptor (see Table 109: Entries in
a Type 1 font dictionary") were optional in Type 1 font dictionaries for the standard 14 fonts. PDF
processors supporting PDF 1.0 to PDF 1.7 files shall have these fonts, or their font metrics and suitable
substitution fonts, available.

These fonts, or their font metrics and suitable substitution fonts, shall be available to the PDF processor.

PJBrs · 2025-07-07T18:16:43Z

[about failing tests]
EDIT
I tested on the wrong branch and will investigate.

EDIT 2
Switching over the Font class to the new font metrics causes space changes in the extractor output. Otherwise the output is the same. I don't know this code, but I still assume that this is for the better. I can also drop the last three patches. But I'd like to hear from a reviewer to see what a good way forward would be.

The essentials in my view: having more complete font metrics available so that we can properly generate right-aligned and centered output.

codecov · 2025-07-07T20:46:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.90%. Comparing base (bfe7178) to head (5be4ae8).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3363      +/-   ##
==========================================
+ Coverage   96.89%   96.90%   +0.01%     
==========================================
  Files          54       56       +2     
  Lines        9263     9293      +30     
  Branches     1695     1695              
==========================================
+ Hits         8975     9005      +30     
  Misses        172      172              
  Partials      116      116

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pypdf/_codecs/base14_fontmetrics.py

stefan6419846 · 2025-07-08T15:07:46Z

pypdf/_codecs/base14_fontmetrics.py

+
+
+FONT_METRICS : Dict[str, Tuple[Dict[str, object], Dict[Any, float]]]  = {
+    "Courier": (


Instead of doing this complex typing and unclear indexing when using the data, I would suggest to use actual data containers for each font.

Example container:

@dataclass(frozen=True) class Font: name: str family: str weight: str ascent: float descent: float cap_height: float x_height: float italic_angle: float flags: int bbox: Tuple[float, float, float, float] character_widths: Dict[str, int]

This is roughly what we have over here: https://github.com/py-pdf/pypdf/blob/main/pypdf/_text_extraction/_layout_mode/_font.py

Do I try to replace this? Or do I add a separate font.py in pypdf?

Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.

Do I try to replace this? Or do I add a separate font.py in pypdf?

Ideally, we avoid duplication if possible.

Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.

I see nothing which would argue against it, especially considering future maintainability.

As far as I can tell, for future purposes, it would be nice to extend the existing Font dataclass so that it includes:

name: str

family: str

weight: str

ascent: float

descent: float

cap_height: float

x_height: float

italic_angle: float

flags: int

bbox: Tuple[float, float, float, float]

Some of this might already be included in

char_map (dict): character map

font_dictionary (dict): font dictionary

To be honest, I have no idea what a dataclass really does, and I wouldn't know how to extend the existing class. So, I propose that I redo this PR with the FONT_METRICS as I've reproduced them locally, and with the script that I've adapted, but without the addition of a new Font dataclass, or changes to the existing one.

That might be nice, but I'll see what I can do myself as well. I've looked a bit further in the code and in the spec. The existing font class has width_map, which corresponds with the character widths in the FONT_METRICS file. It also includes font_dictionary, which, as far as I can tell, just is a pdf font dictionary as you would find it in the page resources of a PDF file.

According to the spec, a font dictionary must usually include a set of font descriptors:

Except for Type 0 fonts, Type 3 fonts in non-tagged PDF documents, and certain standard Type 1 fonts,
every font dictionary shall contain a subsidiary dictionary, the font descriptor, containing font-wide
metrics and other attributes of the font; see 9.8, "Font descriptors".

These font descriptors include basically all other information in the FONT_METRICS class. So, the point is that the existing class will usually include the font descriptors for a font, but not for the 14 core fonts. So, one would actually need a separate dataclass FontDescriptors that reads the first part of the FONT_METRICS, or a dataclass FontDictionary, that includes both the FontDescriptors and the character widths. That way, you could parse the FONT_METRICS using the FontDictionary dataclass and then use the resulting dictionary to instantiate the Font.

Still thinking :-)

I've been looking at pdfminer.six's PDFFont class, https://github.com/pdfminer/pdfminer.six/blob/51683b2528e2aa685dd8b9e61f6ccf9f76a59a62/pdfminer/pdffont.py#L869 , and the more I think about it, the current Font class used by pypdf for text extraction is actually a specific case of what could be a more generic font class.

I think I finally get it. The pypdf Font class has no specific attributes (?) as described in the font descriptor. In most cases, the font descriptor is a dict within a font dictionary, meaning that the information contained in that dictionary would already be available within the pypdf font class. This is not the case, though, for the 14 core fonts. Instead, the FONT_METRICS that I took from pdfminer.six actually are a fallback for when those metrics aren't available. The same holds for the font widths.

So, the most elegant solution would be to:

Move the Font dataclass to the top pypdf directory (one patch)

Add attributes for the information one would expect in a font descriptor (one patch)

Fill the font descriptor information from the font dictionary or from FONT_METRICS (one patch)

Parse widths from FONT_METRICS (if absent; one patch)

This way, the typing can all be done in the Font dataclass.

I must note that, for the other patches I'm preparing (text alignment and text wrapping) I'll initially only need the font widths, but to begin mimicking how pdftk generates a text appearance stream would require the detailed information from the font metrics.

How does this plan sound?

I do not think that the Font class should be public API if you mean this. For the other cases, this is hard to evaluate from my side without seeing it implemented, thus I cannot provide a proper/reliable statement about this here.

So, things begin taking more and more sense. First, there actually is quite some code duplication between pypdf/_cmap.py and pypdf/_text_extraction/_layout_mode/_font.py. Second, in both cases the font metrics added in this patch would enable removing / simplifying some existing functionality and adding font descriptor information. For my own purposes, however, I first and foremost only need the font widths. So, I simplified this PR and limited it to the point that build_font_width_map in _cmap.py reads font widths from the new font metrics information.

Specifically, I removed the huge typing and now limited it to one cast in _cmap.py:
font_width_map = cast(dict[str, float], FONT_METRICS[font_name][1])

Some points for further improvement might be:

Removal of _default_fonts_space_width from _cmap.py, since that information is already present in the new font metrics, and it is incomplete as well

Turning the type of font_width_map within _cmap.py from Dict[Any, float] tot Dict[str, int] (or maybe Dict[Union[str, int], int]

Removing _font_widths.py, which contains incomplete information and does not have license information (for this, I have patches, but I removed it from this PR for now)

Addition of a font descriptor dataclass to _cmap.py

I'd be happy to provide additional PRs for any of the above. For now however, I would really like to concentrate on this PR specifically, since it is a prerequisite for generating a text appearance stream with proper text wrapping and with proper right aligned or centered text.

stefan6419846 · 2025-07-13T09:43:11Z

Thanks for the further changes. There still are some further changes I would like to see:

Move the generator script code into pypdf itself instead of the resources. Otherwise, linting does not trigger for example.
To which extent does the generator script re-use code from pdfminer.six?
Why is there such a "strange" copyright detection logic instead of just reading the corresponding fields?
Although there might be changes in a later PR, I would still prefer to have a proper object-oriented approach for the new data. Past has shown that nothing is more permanent than a temporary solution.

PJBrs · 2025-07-13T11:03:16Z

Thanks for the further changes. There still are some further changes I would like to see:

Move the generator script code into pypdf itself instead of the resources. Otherwise, linting does not trigger for example.

No problem!

To which extent does the generator script re-use code from pdfminer.six?

I think about 20%? The resulting formatting is still very like pdfminer (although their original script did not produce unicode codes but ints). EDIT:
grep -F -x -f /usr/lib64/python3.9/site-packages/pdfminer/fontmetrics.py resources/get_core_fontmetrics.py (excluding lines with only hashes or quotation marks):

            f = line.strip().split(" ")
            if not f:
                continue
            k = f[0]
            if k == "FontName":
                fontname = f[1]
                props = {"FontName": fontname, "Flags": 0}
            elif k == "C":
            elif k in ("CapHeight", "XHeight", "ItalicAngle", "Ascender", "Descender"):
                k = {"Ascender": "Ascent", "Descender": "Descent"}.get(k, k)
                props[k] = float(f[1])
            elif k in ("FontName", "FamilyName", "Weight"):
                k = {"FamilyName": "FontFamily", "Weight": "FontWeight"}.get(k, k)
                props[k] = f[1]
            elif k == "IsFixedPitch":
                if f[1].lower() == "true":
                    props["Flags"] = 64
            elif k == "FontBBox":
                props[k] = tuple(map(float, f[1:5]))

Why is there such a "strange" copyright detection logic instead of just reading the corresponding fields?

This is how it is in the original AFM files:

and

So, the same copyright is in two locations, one after Comment and one after Notice. Only the second, however, includes mention of the trademark. And do notice the lack of a space between "Reserved." and "Helvetica". So, with all that, you indeed get a rather strange copyright detection logic!

Although there might be changes in a later PR, I would still prefer to have a proper object-oriented approach for the new data. Past has shown that nothing is more permanent than a temporary solution.

OK. Then I think I need more information about what that would entail. What I think it does:

Keep all information as it is now
For the widths information, this can be used as I propose using it now, in _cmap.build_font_width_map
Add a @dataclass font_descriptor to _cmap.py that can initialise itself using the information in core_fontmetrics.py
Ideally, such a dataclass could actually use either the Type 1 font information from core_fontmetrics, or an existing /Descriptors dict within an embedded /Font resource. But I would need an easy test solution for that case (ideally, a form that doesn't use one of the core fonts).

Is that correct?

If not, perhaps it might be easier to reach out via discord, where I use the same user name.

stefan6419846 · 2025-07-13T11:15:39Z

In the first step, it would be sufficient for me to just create a new dataclass with the properties outlined previously. Merging it with the existing class can always happen later, but I would like to see real structured data for now. We basically need to define the new dataclass and adapt the generator script to call the dataclass constructor accordingly.

Regarding the pdfminer.six code and the copyrights, I will need to have another look at it in the next days.

PJBrs · 2025-07-13T13:12:09Z

I'm afraid that means I'm officially out of my depth, I think I just lack the prerequisite knowledge to understand what you intend the generator script to do. If possible, please contact me on Discord one of these days so that we can discuss.

And as always, thanks for your comments!

PJBrs · 2025-07-13T16:24:44Z

P.S., this is what Google Gemini thinks:

import re
from dataclasses import dataclass
from typing import Dict, Tuple

@dataclass(frozen=True)
class Font:
    """
    Dataclass to store parsed information from a Type 1 font AFM file.
    """
    name: str
    family: str
    weight: str

    ascent: float
    descent: float
    cap_height: float
    x_height: float
    italic_angle: float
    flags: int
    bbox: Tuple[float, float, float, float]

    character_widths: Dict[str, int]

def parse_afm_file(afm_content: str) -> Font:
    """
    Parses the content of a Type 1 font AFM file and returns a Font dataclass instance.

    Args:
        afm_content: A string containing the full content of the AFM file.

    Returns:
        A Font dataclass instance populated with the parsed data.

    Raises:
        ValueError: If essential font properties are missing or malformed in the AFM content.
    """
    font_properties = {}
    character_widths = {}
    in_char_metrics_section = False

    lines = afm_content.splitlines()

    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Check for start/end of character metrics section
        if line == "StartCharMetrics":
            in_char_metrics_section = True
            continue
        elif line == "EndCharMetrics":
            in_char_metrics_section = False
            continue

        if in_char_metrics_section:
            # Parse character metrics: C <char_code>; WX <width_x>; N <char_name>; ...
            match = re.match(r"C\s+\d+;\s+WX\s+([\d.]+);\s+N\s+([a-zA-Z0-9_.]+);", line)
            if match:
                width = int(float(match.group(1))) # Widths are typically integers in AFM
                char_name = match.group(2)
                character_widths[char_name] = width
        else:
            # Parse general font properties
            parts = line.split(' ', 1) # Split only on the first space
            if len(parts) == 2:
                key, value = parts[0], parts[1]
                font_properties[key] = value

    # Extract and convert properties, handling potential missing values
    try:
        name = font_properties.get("FontName", "Unknown")
        family = font_properties.get("FamilyName", "Unknown")
        weight = font_properties.get("Weight", "Unknown")

        ascent = float(font_properties.get("Ascender", 0.0))
        descent = float(font_properties.get("Descender", 0.0))
        cap_height = float(font_properties.get("CapHeight", 0.0))
        x_height = float(font_properties.get("XHeight", 0.0))
        italic_angle = float(font_properties.get("ItalicAngle", 0.0))

        # Calculate flags: bit 0 is set if IsFixedPitch is true
        is_fixed_pitch = font_properties.get("IsFixedPitch", "false").lower() == "true"
        flags = 1 if is_fixed_pitch else 0

        # Parse FontBBox
        bbox_str = font_properties.get("FontBBox", "0 0 0 0")
        bbox_values = tuple(map(float, bbox_str.split()))
        if len(bbox_values) != 4:
            raise ValueError(f"Malformed FontBBox: {bbox_str}")
        bbox = bbox_values

    except (KeyError, ValueError) as e:
        raise ValueError(f"Error parsing AFM file: Missing or malformed property - {e}")

    return Font(
        name=name,
        family=family,
        weight=weight,
        ascent=ascent,
        descent=descent,
        cap_height=cap_height,
        x_height=x_height,
        italic_angle=italic_angle,
        flags=flags,
        bbox=bbox,
        character_widths=character_widths
    )

stefan6419846 · 2025-07-14T14:31:56Z

I have just used your script and adapted it to show what I mean: https://gist.github.com/stefan6419846/3d368b26ee5260a7886657909f26ca15 The adobe_glyphs module imported there is a standalone copy of https://github.com/py-pdf/pypdf/blob/main/pypdf/_codecs/adobe_glyphs.py

This will generate the code to create instances of the dataclass shown in #3363 (comment) together with the copyrights. The only things are are currently missing are the dataclass definition itself and the necessary imports as well as the mapping in the footer, but this should be easy enough to add to the script.

By the way: We should probably keep the script outside of the original code, but run some linting and testing on it nevertheless to ensure it matches our standards and does not break for some reason.

PJBrs · 2025-07-15T10:43:33Z

@stefan6419846 OK, now I understand better! Basically, you want the script to immediately produce the specific Font instances, instead of ending up in an intermediate form. With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...? At least, that seems to work here.

I understand also, that using this, I can do something like:

if font_name in FONT_METRICS:
    font = FONT_METRICS[font_name]

and then do stuff like:

total_width = sum(font.character_widths[char] for char in "This is a long sentence")
print (total_width)

One question; Why do you only add 255 widths? The AFMs contain about 314 widths.

Second; so, this really collects a lot of information in the Font class. In my local patches, I need the widths (and I expect also some other metrics) for text wrapping a text stream when flattening an annotation. With my original patches, I changed build_font_width_map to include the character widths that are now in the Font instances in FONT_METRICS. However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this? I see two ways forward:

Change build_font_width_map and use the new Font instances, something like this:

+    else:
+        font_name = str(ft["/BaseFont"])[1:]
+        if font_name in FONT_METRICS:
+            font_width_map = cast(Dict[str, float], FONT_METRICS[font_name].character_widths)

Change the Font dataclass so that you can initialise it with a /Font resource dictionary, set attributes bases on FONT_METRICS instances when available, and otherwise set character_widths using the existing build_font_width_map. This makes it possible, in the future, to also set the other metrics based on embedded font information, if available.

EDIT

Where would you add the Font class?
Where would you place the font generation script?

stefan6419846 · 2025-07-15T10:51:51Z

With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...?

Yes, although for the repository and its CI/CD, we might need to extend the current configuration. I am open to help with this if desired/required.

One question; Why do you only add 255 widths? The AFMs contain about 314 widths.

I used a mix of the original pdfminer.six code and your code for writing my script. The limitation is from pdfminer.six. This seems to mostly eliminate the characters with the code -1. I have no hard opinion on this/do not know what is correct.

However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this?

I would split this into two separate topics. This PR should focus on the new container and retrieving the data from the AFM files to use them where appropriate. Possibly unifying this with handling embedded fonts could/should be a separate step in a PR afterwards. This way, we avoid too large PRs which simplifies reviewing the changes from my side.

PJBrs · 2025-07-17T12:49:11Z

@stefan6419846

I used your script and tried to improve it, and I added your font class.

To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this.

I added the font class to pypdf/_font.py. I noticed that other files in this directory have a copyright notice and attribution, please advise, and add your name if you needed!

I just tested locally, and the following works:

            font_name = font_res["/BaseFont"]  # [/"Name"] often also exists, but is deprecated
            if font_name[1:] in FONT_METRICS:
                my_font = FONT_METRICS[font_name[1:]]
                print (sum( my_font.character_widths.get(char, 200) for char in "Such a long sentence, how long is€€€𒈙 it " ))

I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so.

resources/afm_to_dataclass.py

pypdf/_font.py

pypdf/_writer.py

stefan6419846 · 2025-07-17T14:37:31Z

To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this.

I am helping with getting the changes integrated. Without you doing the initial work, I would not have looked into this myself. Doing the changes to the parser has been a little side project, allowing me some insights into AFM files.

I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so.

Your initial PR integrated the new functionality into the existing code and adapted some tests. Wouldn't this be sufficient or does this have any side effects? Codecov is correctly complaining because the new code is never executed.

PJBrs · 2025-07-17T14:41:24Z

To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this.

I am helping with getting the changes integrated. Without you doing the initial work, I would not have looked into this myself. Doing the changes to the parser has been a little side project, allowing me some insights into AFM files.

I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so.

Your initial PR integrated the new functionality into the existing code and adapted some tests. Wouldn't this be sufficient or does this have any side effects? Codecov is correctly complaining because the new code is never executed.

I wrote a very small test file now - tests/test_font.py:

"""Test font-related functionality."""


from pypdf._codecs.core_fontmetrics import FONT_METRICS


def test_font_metrics():
    font_name = "Helvetica"
    my_font = FONT_METRICS[font_name]
    assert my_font.family == "Helvetica"
    assert my_font.weight == "Medium"
    assert my_font.ascent == 718
    assert my_font.descent == -207

    test_string = "This is a long sentence. !@%%^€€€. çûįö¶´"
    charwidth = sum(my_font.character_widths[char] for char in test_string)
    assert charwidth == 19251

    font_name = "Courier-Bold"
    my_font = FONT_METRICS[font_name]
    assert my_font.italic_angle == 0
    assert my_font.flags == 64
    assert my_font.bbox == (-113.0, -250.0, 749.0, 801.0)

If you prefer, I'll add something to the writer test.

This patch adds a new Font dataclass. Its initial use, for now, is to act as a dataclass for the font metrics of the 14 Adobe core fonts. These fonts are usually not embedded in PDF documents, while PDF readers are expected to carry that information themselves.

stefan6419846 · 2025-07-17T14:43:11Z

I am okay with a simple test covering one of the basic entries and possibly one of the explicitly mapped ones as well.

PJBrs · 2025-07-17T14:47:55Z

I am okay with a simple test covering one of the basic entries and possibly one of the explicitly mapped ones as well.

Sounds like what I just pasted, right?

stefan6419846 · 2025-07-17T14:55:59Z

Yes.

This patch adds a new file with the font metrics for the 14 core Type 1 pdf fonts. The file was inspired by the pdfminer.six project, where a very similar one is called fontmetrics.py. The information itself is generated by a separate file added with this patch: resources/afm_to_dataclass.py The PDF specification expects a pdf reader to include these font metrics.

Version 1.7 of the PDF reference lists various alternatives names as accepted for the 14 core fonts, such as Arial for Helvetica and CourierNew for Courier. Add these alternative names to the font metrics.

This patch adds a little test for the Font dataclass to make codecov happy!

PJBrs · 2025-07-17T21:09:14Z

I'm testing some ideas and I get a circular input, I think... Let's let this simmer until I have something that I know works.

PJBrs · 2025-07-18T12:04:40Z

So, I tried yesterday to use the new Font class but I ran in quite some trouble, which basically comes down to how to initialise a Font instance. The reason: I would like to initialise a Font using a font resource dictionary.

As it is, everything works quite well for the 14 core fonts. However, it is important to note that those 14 fonts are actually the exception to the rule, the rule being that, for an embedded font, all metrics and its encoding ought to be available as part of a PDF font resource dictionary. Why is this important? Because, in practice, given the above one would want to be able to initialise a Font instance using a font resource dictionary as present within the PDF file.

This leads to an issue of issue of circular imports: the FONT_METRICS dict imports the Font class. However, once you would like to recognise one of these fonts based on a font resource dict at initialisation, the Font class itself would need to import FONT_METRICS. I don’t see how we can work around this. I tried, and I got:

ImportError: cannot import name 'FONT_METRICS' from partially initialized module 'pypdf._codecs.core_fontmetrics' (most likely due to a circular import) (/usr/lib64/python3.9/site-packages/pypdf/_codecs/core_f
ontmetrics.py)

From a more integrated perspective on what a font is in PDF terms, the font metrics for the 14 core fonts are incomplete. As a font, they should be accompanied by an encoding and a character map, both of which are specified in the font resource dictionary. I would argue that, what in this PR is a Font class, in actuality is more a FontMetrics class.

I can think of two ways forward:

Turn the current Font class into a FontMetrics class, add that to _codecs/core_fontmetrics.py, and if, in the future, pypdf would want to include a full-fledged Font class, then the FontMetrics class would be a subclass in case the user hits a non-embedded instance of a core font.
Keep the current (beginnings of a) Font class, turn the current FONT_METRICS back into a Dict[Dict, Any] (or whatever it was, but not a class), and add an init method that takes a font resource dict as argument and uses the FONT_METRICS for the core fonts if those metrics are not available in the font resource dict itself.

I personally would prefer the second option; I see your point that the second option is less object oriented, but I do think that it is more robust for future changes. Nevertheless, please advise!

PJBrs · 2025-07-18T17:38:46Z

On second (final?) thought, it occurs to me that probably the first option is best.

A Font dataclass already exists in pypdf/_text_extraction/_layout_mode/_font.py and I it seems that associated code is already used in _page. Perhaps it is best to just focus on the missing functionality...

PJBrs · 2025-07-18T20:23:10Z

I've given all this still more thought. So, the AFM file contains a set of widths and a set of font characteristics that we could call FontDescriptor. Existing pypdf code includes incomplete core font width information in two places:

pypdf/pypdf/_cmap.py

Line 107 in bfe7178

_default_fonts_space_width: Dict[str, int] = {

and:

pypdf/pypdf/_text_extraction/_layout_mode/_font_widths.py

Line 2 in bfe7178

STANDARD_WIDTHS = {

Furthermore code already exists in two places to get a set of font widths from a font resource dictionary, in pypdf/_cmap.py:

pypdf/pypdf/_cmap.py

Line 402 in bfe7178

def build_font_width_map(

and in pypdf/_text_extraction/_layout_mode/_font.py:

pypdf/pypdf/_text_extraction/_layout_mode/_font.py

Line 38 in bfe7178

def __post_init__(self) -> None:

In both cases, an incomplete fallback exists for the 14 core fonts that this PR should be able to solve.

Code dealing with FontDescriptor information is nowhere to be seen yet (as far as I can tell). Such information is being used in pdftk to generate appearance streams. For embedded fonts, this ought to be part of a font resource dictionary. Here's an example of an arbitrary PDF that I opened:
{'/Type': '/Font', '/Subtype': '/CIDFontType2', '/CIDSystemInfo': {'/Ordering': 'Identity', '/Registry': 'Adobe', '/Supplement': 0}, '/FontDescriptor': IndirectObject(29, 0, 139871477943648), '/BaseFont': '/RIPFSJ+Tahoma,Bold', '/W': [3, [292], 15, [312], 19, [636], 39, [757], 49, [770], 54, [633], 55, [612], 68, [598], 69, [631], 72, [593], 76, [301], 78, [602], 79, [301], 80, [953], 82, [617], 85, [433], 87, [415], 88, [640], 92, [575], 188, [636]]}

And this is the /FontDescriptor field:
{'/Type': '/FontDescriptor', '/Ascent': 1000, '/CapHeight': 727, '/Descent': -207, '/Flags': 32, '/FontBBox': [-698, -419, 2196, 1065], '/ItalicAngle': 0, '/StemV': 0, '/XHeight': 548, '/FontName': '/RIPFSJ+Tahoma,Bold', '/FontFile2': IndirectObject(30, 0, 139871477943648)}

Adding code that parses FontDescriptor information would be trivial. For instance, info like the above is parsed in at least three places already (as far as I can tell). It's just that the FontDescriptor information is not extracted.

This leads me to the following conclusion:

The Font dataclass in this PR is too broad for the information that it adds
A FontMetrics dataclass would be better, but the widths part would not be so easy to access and no right place exists yet for the FontDescriptor information

I now think that the most elegant solution would be to:

Add a FontDescriptor dataclass
Adapt the afm_to_dataclass.py script so that it creates FONT_METRICS: Dict[str, Tuple[FontDescriptor, Dict[str, int]]] ; in other words, a Dict with font name as key, and a tuple as value. Each tuple consists of a FontDescriptor and a Dict with font widths. This is very similar to the original pdfminer.six code, but with a dataclass for the FontDescriptor instead.

This enables the following subsequent steps (not in this PR):

Correct the core font width information in _cmap.py and in _text_extraction/_layout_mode/_font.py ; these are really trivial changes that then make it very ease to adapt the generate_appearance_stream method so that it can correctly wrap, scale, centre and right-align text. <-- This is the big aim that I'm after, and have most of the code for.
Create a build_font_descriptor_from_dict method (or maybe add a build_font_descriptor_from_dict classmethod to initialise a FontDescriptor, who knows!)
Move the _font.py file to pypdf/_font.py
Perhaps add FontDescriptor attribute to Font class

@stefan6419846 , please advise. It's like a menu :-) Would you like:

The PR as it currently is;
Turn the current Font class into a FontMetrics class, add that to _codecs/core_fontmetrics.py, as discussed above
Keep the current (beginnings of a) Font class, turn the current FONT_METRICS back into a Dict[Dict, Any] (or whatever it was, but not a class), and add an init method that takes a font resource dict as argument (I now think this is a bad idea)
The above proposal. (This I like best).

I'm going to redo the PR in about three weeks or so.

PJBrs changed the title ~~ENH / ROB: Add all font metrics for base 14 Type 1 PDF fonts.~~ ENH: Add all font metrics for base 14 Type 1 PDF fonts. Jul 7, 2025

stefan6419846 reviewed Jul 8, 2025

View reviewed changes

pypdf/_codecs/base14_fontmetrics.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 8, 2025

View reviewed changes

pypdf/_codecs/base14_fontmetrics.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 8, 2025

View reviewed changes

PJBrs marked this pull request as draft July 8, 2025 16:01

PJBrs force-pushed the fontwork branch 3 times, most recently from e2658b8 to 8cb825d Compare July 12, 2025 16:16

PJBrs marked this pull request as ready for review July 12, 2025 16:25

PJBrs marked this pull request as draft July 13, 2025 13:12

MAINT: _writer: Fix docstring for _merge_content_stream_to_page

5e6c41d

PJBrs force-pushed the fontwork branch from 8cb825d to f8f9738 Compare July 17, 2025 12:17

stefan6419846 reviewed Jul 17, 2025

View reviewed changes

resources/afm_to_dataclass.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 17, 2025

View reviewed changes

resources/afm_to_dataclass.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 17, 2025

View reviewed changes

pypdf/_font.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 17, 2025

View reviewed changes

pypdf/_writer.py Show resolved Hide resolved

ENH: Font dataclass

2061f82

This patch adds a new Font dataclass. Its initial use, for now, is to act as a dataclass for the font metrics of the 14 Adobe core fonts. These fonts are usually not embedded in PDF documents, while PDF readers are expected to carry that information themselves.

PJBrs force-pushed the fontwork branch from f8f9738 to 8a89528 Compare July 17, 2025 15:05

PJBrs added 3 commits July 17, 2025 17:25

ROB: font_metrics: Add font aliases

eaccd6e

Version 1.7 of the PDF reference lists various alternatives names as accepted for the 14 core fonts, such as Arial for Helvetica and CourierNew for Courier. Add these alternative names to the font metrics.

ENH: tests: Add a little test for the Font dataclass

cbde019

This patch adds a little test for the Font dataclass to make codecov happy!

PJBrs force-pushed the fontwork branch from 8a89528 to cbde019 Compare July 17, 2025 15:27

PJBrs marked this pull request as ready for review July 17, 2025 15:35

Merge branch 'main' into fontwork

5be4ae8

PJBrs marked this pull request as draft July 17, 2025 21:08



		FONT_METRICS : Dict[str, Tuple[Dict[str, object], Dict[Any, float]]] = {
		"Courier": (

ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363

Are you sure you want to change the base?

ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363

Conversation

PJBrs commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PJBrs commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PJBrs Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefan6419846 commented Jul 13, 2025

Uh oh!

PJBrs commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Jul 13, 2025

Uh oh!

PJBrs commented Jul 13, 2025

Uh oh!

PJBrs commented Jul 13, 2025

Uh oh!

stefan6419846 commented Jul 14, 2025

Uh oh!

PJBrs commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Jul 15, 2025

Uh oh!

PJBrs commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefan6419846 commented Jul 17, 2025

Uh oh!

PJBrs commented Jul 17, 2025

Uh oh!

stefan6419846 commented Jul 17, 2025

Uh oh!

PJBrs commented Jul 17, 2025

Uh oh!

stefan6419846 commented Jul 17, 2025

Uh oh!

PJBrs commented Jul 17, 2025

Uh oh!

PJBrs commented Jul 18, 2025

Uh oh!

PJBrs commented Jul 18, 2025

Uh oh!

PJBrs commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PJBrs commented Jul 7, 2025 •

edited

Loading

PJBrs commented Jul 7, 2025 •

edited

Loading

codecov bot commented Jul 7, 2025 •

edited

Loading

PJBrs Jul 9, 2025 •

edited

Loading

PJBrs commented Jul 13, 2025 •

edited

Loading

PJBrs commented Jul 15, 2025 •

edited

Loading

PJBrs commented Jul 18, 2025 •

edited

Loading