Skip to content

ENH: Add all font metrics for base 14 Type 1 PDF fonts. #3363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Jul 7, 2025

This patch includes font metrics for the standard 14 fonts. This is intended to be useful for generating a text appearance stream, especially if you want to take into account right-aligned or centred text. (I have some other patches that include this as well as text wrapping.)

Note that some of this information was already included in _font_widths.py, but that information is incomplete. I thought it better to copy this information from pdfminer.six and be able to potentially benefit from their work later on, than to improve on what already was included here.

The first three patches introduce the new functionality. The last three patches are for moving the Font class to the new font metrics information and removing the old _font_widths.py file.

This is what the spec has to say about it:

9.6.2.2 Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)

The PostScript language names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows:
Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats,
Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, CourierBoldOblique.

In PDF 1.0 to PDF 1.7, the FirstChar, LastChar, Widths and FontDescriptor (see Table 109: Entries in
a Type 1 font dictionary") were optional in Type 1 font dictionaries for the standard 14 fonts. PDF
processors supporting PDF 1.0 to PDF 1.7 files shall have these fonts, or their font metrics and suitable
substitution fonts, available.

These fonts, or their font metrics and suitable substitution fonts, shall be available to the PDF processor.

@PJBrs PJBrs changed the title ENH / ROB: Add all font metrics for base 14 Type 1 PDF fonts. ENH: Add all font metrics for base 14 Type 1 PDF fonts. Jul 7, 2025
@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 7, 2025

[about failing tests]
EDIT
I tested on the wrong branch and will investigate.

EDIT 2
Switching over the Font class to the new font metrics causes space changes in the extractor output. Otherwise the output is the same. I don't know this code, but I still assume that this is for the better. I can also drop the last three patches. But I'd like to hear from a reviewer to see what a good way forward would be.

The essentials in my view: having more complete font metrics available so that we can properly generate right-aligned and centered output.

Copy link

codecov bot commented Jul 7, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.90%. Comparing base (bfe7178) to head (5be4ae8).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3363      +/-   ##
==========================================
+ Coverage   96.89%   96.90%   +0.01%     
==========================================
  Files          54       56       +2     
  Lines        9263     9293      +30     
  Branches     1695     1695              
==========================================
+ Hits         8975     9005      +30     
  Misses        172      172              
  Partials      116      116              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.



FONT_METRICS : Dict[str, Tuple[Dict[str, object], Dict[Any, float]]] = {
"Courier": (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this complex typing and unclear indexing when using the data, I would suggest to use actual data containers for each font.

Example container:

@dataclass(frozen=True)
class Font:
    name: str
    family: str
    weight: str

    ascent: float
    descent: float
    cap_height: float
    x_height: float
    italic_angle: float
    flags: int
    bbox: Tuple[float, float, float, float]

    character_widths: Dict[str, int]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is roughly what we have over here: https://github.com/py-pdf/pypdf/blob/main/pypdf/_text_extraction/_layout_mode/_font.py

Do I try to replace this? Or do I add a separate font.py in pypdf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I try to replace this? Or do I add a separate font.py in pypdf?

Ideally, we avoid duplication if possible.

Also, this would probably come down to re-implementing the method that produces the font metrics to make it produce the above Font dataclass.

I see nothing which would argue against it, especially considering future maintainability.

Copy link
Contributor Author

@PJBrs PJBrs Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, for future purposes, it would be nice to extend the existing Font dataclass so that it includes:

  • name: str
  • family: str
  • weight: str
  • ascent: float
  • descent: float
  • cap_height: float
  • x_height: float
  • italic_angle: float
  • flags: int
  • bbox: Tuple[float, float, float, float]

Some of this might already be included in

  • char_map (dict): character map
  • font_dictionary (dict): font dictionary

To be honest, I have no idea what a dataclass really does, and I wouldn't know how to extend the existing class. So, I propose that I redo this PR with the FONT_METRICS as I've reproduced them locally, and with the script that I've adapted, but without the addition of a new Font dataclass, or changes to the existing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be nice, but I'll see what I can do myself as well. I've looked a bit further in the code and in the spec. The existing font class has width_map, which corresponds with the character widths in the FONT_METRICS file. It also includes font_dictionary, which, as far as I can tell, just is a pdf font dictionary as you would find it in the page resources of a PDF file.

According to the spec, a font dictionary must usually include a set of font descriptors:

Except for Type 0 fonts, Type 3 fonts in non-tagged PDF documents, and certain standard Type 1 fonts,
every font dictionary shall contain a subsidiary dictionary, the font descriptor, containing font-wide
metrics and other attributes of the font; see 9.8, "Font descriptors".

These font descriptors include basically all other information in the FONT_METRICS class. So, the point is that the existing class will usually include the font descriptors for a font, but not for the 14 core fonts. So, one would actually need a separate dataclass FontDescriptors that reads the first part of the FONT_METRICS, or a dataclass FontDictionary, that includes both the FontDescriptors and the character widths. That way, you could parse the FONT_METRICS using the FontDictionary dataclass and then use the resulting dictionary to instantiate the Font.

Still thinking :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at pdfminer.six's PDFFont class, https://github.com/pdfminer/pdfminer.six/blob/51683b2528e2aa685dd8b9e61f6ccf9f76a59a62/pdfminer/pdffont.py#L869 , and the more I think about it, the current Font class used by pypdf for text extraction is actually a specific case of what could be a more generic font class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I finally get it. The pypdf Font class has no specific attributes (?) as described in the font descriptor. In most cases, the font descriptor is a dict within a font dictionary, meaning that the information contained in that dictionary would already be available within the pypdf font class. This is not the case, though, for the 14 core fonts. Instead, the FONT_METRICS that I took from pdfminer.six actually are a fallback for when those metrics aren't available. The same holds for the font widths.

So, the most elegant solution would be to:

  • Move the Font dataclass to the top pypdf directory (one patch)
  • Add attributes for the information one would expect in a font descriptor (one patch)
  • Fill the font descriptor information from the font dictionary or from FONT_METRICS (one patch)
  • Parse widths from FONT_METRICS (if absent; one patch)

This way, the typing can all be done in the Font dataclass.

I must note that, for the other patches I'm preparing (text alignment and text wrapping) I'll initially only need the font widths, but to begin mimicking how pdftk generates a text appearance stream would require the detailed information from the font metrics.

How does this plan sound?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that the Font class should be public API if you mean this. For the other cases, this is hard to evaluate from my side without seeing it implemented, thus I cannot provide a proper/reliable statement about this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, things begin taking more and more sense. First, there actually is quite some code duplication between pypdf/_cmap.py and pypdf/_text_extraction/_layout_mode/_font.py. Second, in both cases the font metrics added in this patch would enable removing / simplifying some existing functionality and adding font descriptor information. For my own purposes, however, I first and foremost only need the font widths. So, I simplified this PR and limited it to the point that build_font_width_map in _cmap.py reads font widths from the new font metrics information.

Specifically, I removed the huge typing and now limited it to one cast in _cmap.py:
font_width_map = cast(dict[str, float], FONT_METRICS[font_name][1])

Some points for further improvement might be:

  • Removal of _default_fonts_space_width from _cmap.py, since that information is already present in the new font metrics, and it is incomplete as well
  • Turning the type of font_width_map within _cmap.py from Dict[Any, float] tot Dict[str, int] (or maybe Dict[Union[str, int], int]
  • Removing _font_widths.py, which contains incomplete information and does not have license information (for this, I have patches, but I removed it from this PR for now)
  • Addition of a font descriptor dataclass to _cmap.py

I'd be happy to provide additional PRs for any of the above. For now however, I would really like to concentrate on this PR specifically, since it is a prerequisite for generating a text appearance stream with proper text wrapping and with proper right aligned or centered text.

@PJBrs PJBrs marked this pull request as draft July 8, 2025 16:01
@PJBrs PJBrs force-pushed the fontwork branch 3 times, most recently from e2658b8 to 8cb825d Compare July 12, 2025 16:16
@PJBrs PJBrs marked this pull request as ready for review July 12, 2025 16:25
@stefan6419846
Copy link
Collaborator

Thanks for the further changes. There still are some further changes I would like to see:

  • Move the generator script code into pypdf itself instead of the resources. Otherwise, linting does not trigger for example.
  • To which extent does the generator script re-use code from pdfminer.six?
  • Why is there such a "strange" copyright detection logic instead of just reading the corresponding fields?
  • Although there might be changes in a later PR, I would still prefer to have a proper object-oriented approach for the new data. Past has shown that nothing is more permanent than a temporary solution.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 13, 2025

Thanks for the further changes. There still are some further changes I would like to see:

  • Move the generator script code into pypdf itself instead of the resources. Otherwise, linting does not trigger for example.

No problem!

  • To which extent does the generator script re-use code from pdfminer.six?

I think about 20%? The resulting formatting is still very like pdfminer (although their original script did not produce unicode codes but ints). EDIT:
grep -F -x -f /usr/lib64/python3.9/site-packages/pdfminer/fontmetrics.py resources/get_core_fontmetrics.py (excluding lines with only hashes or quotation marks):

            f = line.strip().split(" ")
            if not f:
                continue
            k = f[0]
            if k == "FontName":
                fontname = f[1]
                props = {"FontName": fontname, "Flags": 0}
            elif k == "C":
            elif k in ("CapHeight", "XHeight", "ItalicAngle", "Ascender", "Descender"):
                k = {"Ascender": "Ascent", "Descender": "Descent"}.get(k, k)
                props[k] = float(f[1])
            elif k in ("FontName", "FamilyName", "Weight"):
                k = {"FamilyName": "FontFamily", "Weight": "FontWeight"}.get(k, k)
                props[k] = f[1]
            elif k == "IsFixedPitch":
                if f[1].lower() == "true":
                    props["Flags"] = 64
            elif k == "FontBBox":
                props[k] = tuple(map(float, f[1:5]))

  • Why is there such a "strange" copyright detection logic instead of just reading the corresponding fields?

This is how it is in the original AFM files:

Comment Copyright (c) 1985, 1987, 1989, 1990, 1997 Adobe Systems Incorporated. All Rights Reserved.

and

Notice Copyright (c) 1985, 1987, 1989, 1990, 1997 Adobe Systems Incorporated. All Rights Reserved.Helvetica is a trademark of Linotype-Hell AG and/or its subsidiaries.

So, the same copyright is in two locations, one after Comment and one after Notice. Only the second, however, includes mention of the trademark. And do notice the lack of a space between "Reserved." and "Helvetica". So, with all that, you indeed get a rather strange copyright detection logic!

  • Although there might be changes in a later PR, I would still prefer to have a proper object-oriented approach for the new data. Past has shown that nothing is more permanent than a temporary solution.

OK. Then I think I need more information about what that would entail. What I think it does:

  • Keep all information as it is now
  • For the widths information, this can be used as I propose using it now, in _cmap.build_font_width_map
  • Add a @dataclass font_descriptor to _cmap.py that can initialise itself using the information in core_fontmetrics.py
  • Ideally, such a dataclass could actually use either the Type 1 font information from core_fontmetrics, or an existing /Descriptors dict within an embedded /Font resource. But I would need an easy test solution for that case (ideally, a form that doesn't use one of the core fonts).

Is that correct?

If not, perhaps it might be easier to reach out via discord, where I use the same user name.

@stefan6419846
Copy link
Collaborator

In the first step, it would be sufficient for me to just create a new dataclass with the properties outlined previously. Merging it with the existing class can always happen later, but I would like to see real structured data for now. We basically need to define the new dataclass and adapt the generator script to call the dataclass constructor accordingly.

Regarding the pdfminer.six code and the copyrights, I will need to have another look at it in the next days.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 13, 2025

I'm afraid that means I'm officially out of my depth, I think I just lack the prerequisite knowledge to understand what you intend the generator script to do. If possible, please contact me on Discord one of these days so that we can discuss.

And as always, thanks for your comments!

@PJBrs PJBrs marked this pull request as draft July 13, 2025 13:12
@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 13, 2025

P.S., this is what Google Gemini thinks:

import re
from dataclasses import dataclass
from typing import Dict, Tuple

@dataclass(frozen=True)
class Font:
    """
    Dataclass to store parsed information from a Type 1 font AFM file.
    """
    name: str
    family: str
    weight: str

    ascent: float
    descent: float
    cap_height: float
    x_height: float
    italic_angle: float
    flags: int
    bbox: Tuple[float, float, float, float]

    character_widths: Dict[str, int]

def parse_afm_file(afm_content: str) -> Font:
    """
    Parses the content of a Type 1 font AFM file and returns a Font dataclass instance.

    Args:
        afm_content: A string containing the full content of the AFM file.

    Returns:
        A Font dataclass instance populated with the parsed data.

    Raises:
        ValueError: If essential font properties are missing or malformed in the AFM content.
    """
    font_properties = {}
    character_widths = {}
    in_char_metrics_section = False

    lines = afm_content.splitlines()

    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Check for start/end of character metrics section
        if line == "StartCharMetrics":
            in_char_metrics_section = True
            continue
        elif line == "EndCharMetrics":
            in_char_metrics_section = False
            continue

        if in_char_metrics_section:
            # Parse character metrics: C <char_code>; WX <width_x>; N <char_name>; ...
            match = re.match(r"C\s+\d+;\s+WX\s+([\d.]+);\s+N\s+([a-zA-Z0-9_.]+);", line)
            if match:
                width = int(float(match.group(1))) # Widths are typically integers in AFM
                char_name = match.group(2)
                character_widths[char_name] = width
        else:
            # Parse general font properties
            parts = line.split(' ', 1) # Split only on the first space
            if len(parts) == 2:
                key, value = parts[0], parts[1]
                font_properties[key] = value

    # Extract and convert properties, handling potential missing values
    try:
        name = font_properties.get("FontName", "Unknown")
        family = font_properties.get("FamilyName", "Unknown")
        weight = font_properties.get("Weight", "Unknown")

        ascent = float(font_properties.get("Ascender", 0.0))
        descent = float(font_properties.get("Descender", 0.0))
        cap_height = float(font_properties.get("CapHeight", 0.0))
        x_height = float(font_properties.get("XHeight", 0.0))
        italic_angle = float(font_properties.get("ItalicAngle", 0.0))

        # Calculate flags: bit 0 is set if IsFixedPitch is true
        is_fixed_pitch = font_properties.get("IsFixedPitch", "false").lower() == "true"
        flags = 1 if is_fixed_pitch else 0

        # Parse FontBBox
        bbox_str = font_properties.get("FontBBox", "0 0 0 0")
        bbox_values = tuple(map(float, bbox_str.split()))
        if len(bbox_values) != 4:
            raise ValueError(f"Malformed FontBBox: {bbox_str}")
        bbox = bbox_values

    except (KeyError, ValueError) as e:
        raise ValueError(f"Error parsing AFM file: Missing or malformed property - {e}")

    return Font(
        name=name,
        family=family,
        weight=weight,
        ascent=ascent,
        descent=descent,
        cap_height=cap_height,
        x_height=x_height,
        italic_angle=italic_angle,
        flags=flags,
        bbox=bbox,
        character_widths=character_widths
    )

@stefan6419846
Copy link
Collaborator

I have just used your script and adapted it to show what I mean: https://gist.github.com/stefan6419846/3d368b26ee5260a7886657909f26ca15 The adobe_glyphs module imported there is a standalone copy of https://github.com/py-pdf/pypdf/blob/main/pypdf/_codecs/adobe_glyphs.py

This will generate the code to create instances of the dataclass shown in #3363 (comment) together with the copyrights. The only things are are currently missing are the dataclass definition itself and the necessary imports as well as the mapping in the footer, but this should be easy enough to add to the script.

By the way: We should probably keep the script outside of the original code, but run some linting and testing on it nevertheless to ensure it matches our standards and does not break for some reason.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 15, 2025

@stefan6419846 OK, now I understand better! Basically, you want the script to immediately produce the specific Font instances, instead of ending up in an intermediate form. With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...? At least, that seems to work here.

I understand also, that using this, I can do something like:

if font_name in FONT_METRICS:
    font = FONT_METRICS[font_name]

and then do stuff like:

total_width = sum(font.character_widths[char] for char in "This is a long sentence")
print (total_width)

One question; Why do you only add 255 widths? The AFMs contain about 314 widths.

Second; so, this really collects a lot of information in the Font class. In my local patches, I need the widths (and I expect also some other metrics) for text wrapping a text stream when flattening an annotation. With my original patches, I changed build_font_width_map to include the character widths that are now in the Font instances in FONT_METRICS. However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this? I see two ways forward:

  1. Change build_font_width_map and use the new Font instances, something like this:
+    else:
+        font_name = str(ft["/BaseFont"])[1:]
+        if font_name in FONT_METRICS:
+            font_width_map = cast(Dict[str, float], FONT_METRICS[font_name].character_widths)

  1. Change the Font dataclass so that you can initialise it with a /Font resource dictionary, set attributes bases on FONT_METRICS instances when available, and otherwise set character_widths using the existing build_font_width_map. This makes it possible, in the future, to also set the other metrics based on embedded font information, if available.

EDIT

  1. Where would you add the Font class?
  2. Where would you place the font generation script?

@stefan6419846
Copy link
Collaborator

With regard to linting / typing, as far as I can tell, I can just run ruff and mypy locally where it can find the script...?

Yes, although for the repository and its CI/CD, we might need to extend the current configuration. I am open to help with this if desired/required.

One question; Why do you only add 255 widths? The AFMs contain about 314 widths.

I used a mix of the original pdfminer.six code and your code for writing my script. The limitation is from pdfminer.six. This seems to mostly eliminate the characters with the code -1. I have no hard opinion on this/do not know what is correct.

However, the new Font class would not include character widths that are available for embedded fonts. So, how would you proceed with this?

I would split this into two separate topics. This PR should focus on the new container and retrieving the data from the AFM files to use them where appropriate. Possibly unifying this with handling embedded fonts could/should be a separate step in a PR afterwards. This way, we avoid too large PRs which simplifies reviewing the changes from my side.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 17, 2025

@stefan6419846

I used your script and tried to improve it, and I added your font class.

To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this.

I added the font class to pypdf/_font.py. I noticed that other files in this directory have a copyright notice and attribution, please advise, and add your name if you needed!

I just tested locally, and the following works:

            font_name = font_res["/BaseFont"]  # [/"Name"] often also exists, but is deprecated
            if font_name[1:] in FONT_METRICS:
                my_font = FONT_METRICS[font_name[1:]]
                print (sum( my_font.character_widths.get(char, 200) for char in "Such a long sentence, how long is€€€𒈙 it " ))

I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so.

@stefan6419846
Copy link
Collaborator

To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this.

I am helping with getting the changes integrated. Without you doing the initial work, I would not have looked into this myself. Doing the changes to the parser has been a little side project, allowing me some insights into AFM files.

I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so.

Your initial PR integrated the new functionality into the existing code and adapted some tests. Wouldn't this be sufficient or does this have any side effects? Codecov is correctly complaining because the new code is never executed.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 17, 2025

To be fair, this PR more and more looks like you answering my bug report instead of me trying to contribute new code; please adjust copyright and attribution accordingly if and when you pull this.

I am helping with getting the changes integrated. Without you doing the initial work, I would not have looked into this myself. Doing the changes to the parser has been a little side project, allowing me some insights into AFM files.

I noticed that codecov would like a test. I may be able to cobble up something, and otherwise it will be in a month or so.

Your initial PR integrated the new functionality into the existing code and adapted some tests. Wouldn't this be sufficient or does this have any side effects? Codecov is correctly complaining because the new code is never executed.

I wrote a very small test file now - tests/test_font.py:

"""Test font-related functionality."""


from pypdf._codecs.core_fontmetrics import FONT_METRICS


def test_font_metrics():
    font_name = "Helvetica"
    my_font = FONT_METRICS[font_name]
    assert my_font.family == "Helvetica"
    assert my_font.weight == "Medium"
    assert my_font.ascent == 718
    assert my_font.descent == -207

    test_string = "This is a long sentence. !@%%^€€€. çûįö¶´"
    charwidth = sum(my_font.character_widths[char] for char in test_string)
    assert charwidth == 19251

    font_name = "Courier-Bold"
    my_font = FONT_METRICS[font_name]
    assert my_font.italic_angle == 0
    assert my_font.flags == 64
    assert my_font.bbox == (-113.0, -250.0, 749.0, 801.0)

If you prefer, I'll add something to the writer test.

This patch adds a new Font dataclass. Its initial use, for now,
is to act as a dataclass for the font metrics of the 14 Adobe
core fonts. These fonts are usually not embedded in PDF documents,
while PDF readers are expected to carry that information themselves.
@stefan6419846
Copy link
Collaborator

I am okay with a simple test covering one of the basic entries and possibly one of the explicitly mapped ones as well.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 17, 2025

I am okay with a simple test covering one of the basic entries and possibly one of the explicitly mapped ones as well.

Sounds like what I just pasted, right?

@stefan6419846
Copy link
Collaborator

Yes.

PJBrs added 3 commits July 17, 2025 17:25
This patch adds a new file with the font metrics
for the 14 core Type 1 pdf fonts. The file was
inspired by the pdfminer.six project, where a
very similar one is called fontmetrics.py.
The information itself is generated by a
separate file added with this patch:
resources/afm_to_dataclass.py

The PDF specification expects a pdf reader to
include these font metrics.
Version 1.7 of the PDF reference lists various alternatives
names as accepted for the 14 core fonts, such as Arial for
Helvetica and CourierNew for Courier. Add these alternative
names to the font metrics.
This patch adds a little test for the Font dataclass
to make codecov happy!
@PJBrs PJBrs marked this pull request as ready for review July 17, 2025 15:35
@PJBrs PJBrs marked this pull request as draft July 17, 2025 21:08
@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 17, 2025

I'm testing some ideas and I get a circular input, I think... Let's let this simmer until I have something that I know works.

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 18, 2025

So, I tried yesterday to use the new Font class but I ran in quite some trouble, which basically comes down to how to initialise a Font instance. The reason: I would like to initialise a Font using a font resource dictionary.

As it is, everything works quite well for the 14 core fonts. However, it is important to note that those 14 fonts are actually the exception to the rule, the rule being that, for an embedded font, all metrics and its encoding ought to be available as part of a PDF font resource dictionary. Why is this important? Because, in practice, given the above one would want to be able to initialise a Font instance using a font resource dictionary as present within the PDF file.

This leads to an issue of issue of circular imports: the FONT_METRICS dict imports the Font class. However, once you would like to recognise one of these fonts based on a font resource dict at initialisation, the Font class itself would need to import FONT_METRICS. I don’t see how we can work around this. I tried, and I got:

ImportError: cannot import name 'FONT_METRICS' from partially initialized module 'pypdf._codecs.core_fontmetrics' (most likely due to a circular import) (/usr/lib64/python3.9/site-packages/pypdf/_codecs/core_f
ontmetrics.py)

From a more integrated perspective on what a font is in PDF terms, the font metrics for the 14 core fonts are incomplete. As a font, they should be accompanied by an encoding and a character map, both of which are specified in the font resource dictionary. I would argue that, what in this PR is a Font class, in actuality is more a FontMetrics class.

I can think of two ways forward:

  1. Turn the current Font class into a FontMetrics class, add that to _codecs/core_fontmetrics.py, and if, in the future, pypdf would want to include a full-fledged Font class, then the FontMetrics class would be a subclass in case the user hits a non-embedded instance of a core font.
  2. Keep the current (beginnings of a) Font class, turn the current FONT_METRICS back into a Dict[Dict, Any] (or whatever it was, but not a class), and add an init method that takes a font resource dict as argument and uses the FONT_METRICS for the core fonts if those metrics are not available in the font resource dict itself.

I personally would prefer the second option; I see your point that the second option is less object oriented, but I do think that it is more robust for future changes. Nevertheless, please advise!

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 18, 2025

On second (final?) thought, it occurs to me that probably the first option is best.

A Font dataclass already exists in pypdf/_text_extraction/_layout_mode/_font.py and I it seems that associated code is already used in _page. Perhaps it is best to just focus on the missing functionality...

@PJBrs
Copy link
Contributor Author

PJBrs commented Jul 18, 2025

I've given all this still more thought. So, the AFM file contains a set of widths and a set of font characteristics that we could call FontDescriptor. Existing pypdf code includes incomplete core font width information in two places:

_default_fonts_space_width: Dict[str, int] = {
and:

Furthermore code already exists in two places to get a set of font widths from a font resource dictionary, in pypdf/_cmap.py:

def build_font_width_map(

and in pypdf/_text_extraction/_layout_mode/_font.py:
def __post_init__(self) -> None:

In both cases, an incomplete fallback exists for the 14 core fonts that this PR should be able to solve.

Code dealing with FontDescriptor information is nowhere to be seen yet (as far as I can tell). Such information is being used in pdftk to generate appearance streams. For embedded fonts, this ought to be part of a font resource dictionary. Here's an example of an arbitrary PDF that I opened:
{'/Type': '/Font', '/Subtype': '/CIDFontType2', '/CIDSystemInfo': {'/Ordering': 'Identity', '/Registry': 'Adobe', '/Supplement': 0}, '/FontDescriptor': IndirectObject(29, 0, 139871477943648), '/BaseFont': '/RIPFSJ+Tahoma,Bold', '/W': [3, [292], 15, [312], 19, [636], 39, [757], 49, [770], 54, [633], 55, [612], 68, [598], 69, [631], 72, [593], 76, [301], 78, [602], 79, [301], 80, [953], 82, [617], 85, [433], 87, [415], 88, [640], 92, [575], 188, [636]]}

And this is the /FontDescriptor field:
{'/Type': '/FontDescriptor', '/Ascent': 1000, '/CapHeight': 727, '/Descent': -207, '/Flags': 32, '/FontBBox': [-698, -419, 2196, 1065], '/ItalicAngle': 0, '/StemV': 0, '/XHeight': 548, '/FontName': '/RIPFSJ+Tahoma,Bold', '/FontFile2': IndirectObject(30, 0, 139871477943648)}

Adding code that parses FontDescriptor information would be trivial. For instance, info like the above is parsed in at least three places already (as far as I can tell). It's just that the FontDescriptor information is not extracted.

This leads me to the following conclusion:

  • The Font dataclass in this PR is too broad for the information that it adds
  • A FontMetrics dataclass would be better, but the widths part would not be so easy to access and no right place exists yet for the FontDescriptor information

I now think that the most elegant solution would be to:

  • Add a FontDescriptor dataclass
  • Adapt the afm_to_dataclass.py script so that it creates FONT_METRICS: Dict[str, Tuple[FontDescriptor, Dict[str, int]]] ; in other words, a Dict with font name as key, and a tuple as value. Each tuple consists of a FontDescriptor and a Dict with font widths. This is very similar to the original pdfminer.six code, but with a dataclass for the FontDescriptor instead.

This enables the following subsequent steps (not in this PR):

  1. Correct the core font width information in _cmap.py and in _text_extraction/_layout_mode/_font.py ; these are really trivial changes that then make it very ease to adapt the generate_appearance_stream method so that it can correctly wrap, scale, centre and right-align text. <-- This is the big aim that I'm after, and have most of the code for.
  2. Create a build_font_descriptor_from_dict method (or maybe add a build_font_descriptor_from_dict classmethod to initialise a FontDescriptor, who knows!)
  3. Move the _font.py file to pypdf/_font.py
  4. Perhaps add FontDescriptor attribute to Font class

@stefan6419846 , please advise. It's like a menu :-) Would you like:

  1. The PR as it currently is;
  2. Turn the current Font class into a FontMetrics class, add that to _codecs/core_fontmetrics.py, as discussed above
  3. Keep the current (beginnings of a) Font class, turn the current FONT_METRICS back into a Dict[Dict, Any] (or whatever it was, but not a class), and add an init method that takes a font resource dict as argument (I now think this is a bad idea)
  4. The above proposal. (This I like best).

I'm going to redo the PR in about three weeks or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants