Skip to content

Conversation

mbollmann
Copy link
Member

@mbollmann mbollmann commented Jun 1, 2025

Implements minimal-diff XML saving, and adds integration tests to ensure that all XML files can be loaded and saved again without loss of information, as a step towards #4766. Concretely, this PR:

  • Ensures that loading an XML file & immediately saving it again does not change/lose information.
    • Edge cases: <mrf> tags and XML comments are currently not preserved. addressed
  • Ensures that loading an XML file & immediately saving it again does not cause unnecessary changes that just create noise in the diffs.
    • Exception: Superfluous empty tags that currently exist in some XML files. addressed
    • Exception: <colocated> blocks that do not exhaustively list all colocated volumes. addressed
  • Fixes a few minor serialization bugs that were discovered during this testing.
  • Adds support for <mrf> tags, refactors Event.colocated_ids to make clear which IDs where defined in the XML and which were inferred automatically, adds PaperType to indicate frontmatter & backmatter (the latter was not previously supported).

Work in progress, currently has an initial implementation that works on the "Toy Anthology" in the tests folder, but not on the full one yet. Works on the full Anthology data now, after fixing a few minor things in the XML and defining plenty of exception rules and xfails. without exception.

TODOs:

  • Update docs and CHANGELOG
  • Check codecov report
  • Rebase this to reduce the incredibly noisy commit history

@mbollmann mbollmann self-assigned this Jun 1, 2025
@mbollmann mbollmann added the python-library Concerning the acl-anthology-py library label Jun 1, 2025
Copy link

codecov bot commented Jun 1, 2025

Codecov Report

Attention: Patch coverage is 98.29060% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.75%. Comparing base (9302755) to head (1bffbe5).
Report is 14 commits behind head on python-dev.

Files with missing lines Patch % Lines
python/acl_anthology/collections/event.py 92.85% 1 Missing ⚠️
python/acl_anthology/utils/xml.py 98.24% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           python-dev    #5337      +/-   ##
==============================================
+ Coverage       93.67%   93.75%   +0.08%     
==============================================
  Files              35       35              
  Lines            2781     2850      +69     
==============================================
+ Hits             2605     2672      +67     
- Misses            176      178       +2     
Files with missing lines Coverage Δ
python/acl_anthology/collections/__init__.py 100.00% <100.00%> (ø)
python/acl_anthology/collections/collection.py 97.61% <100.00%> (+0.89%) ⬆️
python/acl_anthology/collections/eventindex.py 92.95% <100.00%> (-0.20%) ⬇️
python/acl_anthology/collections/paper.py 92.41% <100.00%> (+0.10%) ⬆️
python/acl_anthology/collections/types.py 100.00% <100.00%> (ø)
python/acl_anthology/people/name.py 97.81% <100.00%> (-0.73%) ⬇️
python/acl_anthology/sigs.py 97.02% <100.00%> (ø)
python/acl_anthology/utils/git.py 76.27% <100.00%> (-6.78%) ⬇️
python/acl_anthology/venues.py 88.40% <100.00%> (ø)
python/acl_anthology/collections/event.py 99.20% <92.85%> (-0.80%) ⬇️
... and 1 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mbollmann mbollmann force-pushed the python-minimal-diff-xml-saving branch from 4ecb094 to 2a907da Compare June 3, 2025 19:00
@mbollmann mbollmann marked this pull request as ready for review June 3, 2025 19:07
@mbollmann
Copy link
Member Author

@mjpost I would appreciate a review on this if you have time. It implements a feature I've been meaning to work on for a long time now — guaranteeing that saving Python objects back to XML is non-destructive and causes minimal diffs — to make it safe & attractive to use this library for any kind of modifications to our data. I’ve rebased & force-pushed this entire branch to clean up the commit history, so it may be easiest to read commit by commit. (8dc6af2 and maybe bc93463 are the big ones.)

@mjpost
Copy link
Member

mjpost commented Jun 3, 2025

Sure thing, I'll do this asap.

Copy link
Member

@mjpost mjpost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a large change and I'm a little confused at the high-level purpose. It is likely because I am not yet very familiar with the library and likely won't be till I've ported an ingestion script to it or something of that kind. When I work with XML, it is usually to ingest new data (where I may need to preserve existing volume structure) or to make small modifications to author tags or something of that order. To maintain XML consistency, I've always just used indent(), which I spent some time finessing some time back.

I realize there is no clear question here but maybe there is something for you to respond to.

"""Saves this collection as an XML file.

Arguments:
path: The filename to save to. If None, defaults to `self.path`.
minimal_diff: If True (default), will compare against an existing XML file in `self.path` to minimize the difference, i.e., to prevent noise from changes in the XML that make no semantic difference. See [`utils.xml.ensure_minimal_diff`][acl_anthology.utils.xml.ensure_minimal_diff] for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the high-level PR description, it seems odd to me that this would be a flag. In the old code that I'm familiar with, the XML elements are aware of their "tails", which contains the formatted whitespace between tags, allowing perfect reconstruction. Why do you have this argument here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing at all to do with tails or whitespace or indentation, so I'm a little confused how to answer. Hopefully my explanation below makes it clear. Otherwise, there are plenty of test cases in this PR that document the expected behaviour when ensure_minimal_diff() is called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, as with your answer elsewhere. What I'm confused about is why this is a parameter. Under what setting would we not wish to preserve formatting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set minimal_diff=False, you get a "canonical" order of attributes of tags as they are defined in the to_xml() functions throughout the classes. A dev purpose of this was to compare the results with and without the new algorithm enabled. It just seemed natural to me to have it, but it may not be important any longer now that all tests pass.

@mbollmann
Copy link
Member Author

mbollmann commented Jun 6, 2025

This is quite a large change and I'm a little confused at the high-level purpose. It is likely because I am not yet very familiar with the library and likely won't be till I've ported an ingestion script to it or something of that kind. When I work with XML, it is usually to ingest new data (where I may need to preserve existing volume structure) or to make small modifications to author tags or something of that order. To maintain XML consistency, I've always just used indent(), which I spent some time finessing some time back.

I realize there is no clear question here but maybe there is something for you to respond to.

For ingesting new collections, this PR is (mostly) irrelevant. This is intended for modifications to existing files.

When we want to modify the XML with the library, the entire XML file will get rewritten. It's not really feasible to track which data has changed inside the Python code and only write back that, for a variety of reasons. But there are many potential sources of noise that could be introduced here: most prominently, XML attributes and tags getting written back in a different order, because in many instances this makes no semantic difference. Of course, the order of <author> tags in a <paper> block is important, but it doesn't matter in what order <title>, <publisher>, <pages>, etc. appear, or if a missing publisher is indicated by <publisher/> or not having the tag at all. This is not uniform throughout our XML in the slightest. There are more "semantically meaningless" differences, which are documented in the test cases. Notably, this has nothing at all to do with indentation, which continues to be handled by indent().

This means that in many, many, many cases (EDIT: see next comment), if we wanted to use the library to make small changes (which I really want to be able to do), like

anthology = Anthology()
paper = anthology.get("20xx.acl-long.42")
paper.authors[0] = NameSpec("Foo Bar, Baz")
anthology.save()

...this could potentially introduce hundreds of lines of diffs, even if only one line has actually meaningfully changed. I feel this would be a major barrier for actually using the library for such purposes.

@mbollmann
Copy link
Member Author

I just tested running the following:

>>> anthology = Anthology(datadir=PosixPath('../data'), verbose=True)
>>> for collection in anthology.collections.values():
...     collection.load()
...     collection.save(minimal_diff=False)

This writes back the XML files with no changes at all to the data – this is verified by the integration tests in this PR. However:

$ git diff --shortstat
 1273 files changed, 57916 insertions(+), 57929 deletions(-)

The diff contains thousands of lines like this:

-  <volume id="1" ingest-date="2021-10-27" type="proceedings">
+  <volume id="1" type="proceedings" ingest-date="2021-10-27">
-      <attachment type="presentation" hash="1f79a932">1961.earlymt-1.2.Presentation.pdf</attachment>
+      <attachment hash="1f79a932" type="presentation">1961.earlymt-1.2.Presentation.pdf</attachment>
-      <url hash="2af34e42">1971.earlymt-1.6</url>
       <pages>77-94</pages>
+      <url hash="2af34e42">1971.earlymt-1.6</url>
+      <isbn>3-540-59040-4</isbn>
       <month>April 26–28</month>
       <year>1993</year>
-      <isbn>3-540-59040-4</isbn>

The high-level purpose of this PR is to be smart about this and prevent that.

@mjpost
Copy link
Member

mjpost commented Jun 8, 2025

Thanks, the example is helpful. The need for preserving formatting is clear, as is the difficulty of doing it when you're operating through the library, rather than (as has been my experience so far) directly operating on the XML.

Copy link
Member

@mjpost mjpost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to merge this. I haven't had a close, user-level look, but also likely won't have time in the near future. Unfortunately I can't give detailed feedback until I get my hands dirty, but I also don't seen any reason to hold this up.

@mjpost
Copy link
Member

mjpost commented Jun 11, 2025

Just to be clear, I'm good with this, will leave merging to you.

@mbollmann mbollmann merged commit cf43af2 into python-dev Jun 20, 2025
14 checks passed
@mbollmann mbollmann deleted the python-minimal-diff-xml-saving branch June 20, 2025 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python-library Concerning the acl-anthology-py library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants