Implement minimal-diff XML saving #5337

mbollmann · 2025-06-01T19:04:12Z

Implements minimal-diff XML saving, and adds integration tests to ensure that all XML files can be loaded and saved again without loss of information, as a step towards #4766. Concretely, this PR:

Ensures that loading an XML file & immediately saving it again does not change/lose information.
- ~~Edge cases: <mrf> tags and XML comments are currently not preserved.~~ addressed
Ensures that loading an XML file & immediately saving it again does not cause unnecessary changes that just create noise in the diffs.
- ~~Exception: Superfluous empty tags that currently exist in some XML files.~~ addressed
- ~~Exception: <colocated> blocks that do not exhaustively list all colocated volumes.~~ addressed
Fixes a few minor serialization bugs that were discovered during this testing.
Adds support for <mrf> tags, refactors Event.colocated_ids to make clear which IDs where defined in the XML and which were inferred automatically, adds PaperType to indicate frontmatter & backmatter (the latter was not previously supported).

~~Work in progress, currently has an initial implementation that works on the "Toy Anthology" in the tests folder, but not on the full one yet.~~ Works on the full Anthology data now, ~~after fixing a few minor things in the XML and defining plenty of exception rules and xfails.~~ without exception.

TODOs:

Update docs and CHANGELOG
Check codecov report
Rebase this to reduce the incredibly noisy commit history

codecov · 2025-06-01T19:07:17Z

Codecov Report

Attention: Patch coverage is 98.29060% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.75%. Comparing base (9302755) to head (1bffbe5).
Report is 14 commits behind head on python-dev.

Files with missing lines	Patch %	Lines
python/acl_anthology/collections/event.py	92.85%	1 Missing ⚠️
python/acl_anthology/utils/xml.py	98.24%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##           python-dev    #5337      +/-   ##
==============================================
+ Coverage       93.67%   93.75%   +0.08%     
==============================================
  Files              35       35              
  Lines            2781     2850      +69     
==============================================
+ Hits             2605     2672      +67     
- Misses            176      178       +2

Files with missing lines	Coverage Δ
python/acl_anthology/collections/__init__.py	`100.00% <100.00%> (ø)`
python/acl_anthology/collections/collection.py	`97.61% <100.00%> (+0.89%)`	⬆️
python/acl_anthology/collections/eventindex.py	`92.95% <100.00%> (-0.20%)`	⬇️
python/acl_anthology/collections/paper.py	`92.41% <100.00%> (+0.10%)`	⬆️
python/acl_anthology/collections/types.py	`100.00% <100.00%> (ø)`
python/acl_anthology/people/name.py	`97.81% <100.00%> (-0.73%)`	⬇️
python/acl_anthology/sigs.py	`97.02% <100.00%> (ø)`
python/acl_anthology/utils/git.py	`76.27% <100.00%> (-6.78%)`	⬇️
python/acl_anthology/venues.py	`88.40% <100.00%> (ø)`
python/acl_anthology/collections/event.py	`99.20% <92.85%> (-0.80%)`	⬇️
... and 1 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mbollmann · 2025-06-03T19:14:26Z

@mjpost I would appreciate a review on this if you have time. It implements a feature I've been meaning to work on for a long time now — guaranteeing that saving Python objects back to XML is non-destructive and causes minimal diffs — to make it safe & attractive to use this library for any kind of modifications to our data. I’ve rebased & force-pushed this entire branch to clean up the commit history, so it may be easiest to read commit by commit. (8dc6af2 and maybe bc93463 are the big ones.)

mjpost · 2025-06-03T19:16:19Z

Sure thing, I'll do this asap.

mjpost

This is quite a large change and I'm a little confused at the high-level purpose. It is likely because I am not yet very familiar with the library and likely won't be till I've ported an ingestion script to it or something of that kind. When I work with XML, it is usually to ingest new data (where I may need to preserve existing volume structure) or to make small modifications to author tags or something of that order. To maintain XML consistency, I've always just used indent(), which I spent some time finessing some time back.

I realize there is no clear question here but maybe there is something for you to respond to.

mjpost · 2025-06-06T18:57:30Z

python/acl_anthology/collections/collection.py

        """Saves this collection as an XML file.

        Arguments:
            path: The filename to save to. If None, defaults to `self.path`.
+            minimal_diff: If True (default), will compare against an existing XML file in `self.path` to minimize the difference, i.e., to prevent noise from changes in the XML that make no semantic difference.  See [`utils.xml.ensure_minimal_diff`][acl_anthology.utils.xml.ensure_minimal_diff] for details.


Based on the high-level PR description, it seems odd to me that this would be a flag. In the old code that I'm familiar with, the XML elements are aware of their "tails", which contains the formatted whitespace between tags, allowing perfect reconstruction. Why do you have this argument here?

This has nothing at all to do with tails or whitespace or indentation, so I'm a little confused how to answer. Hopefully my explanation below makes it clear. Otherwise, there are plenty of test cases in this PR that document the expected behaviour when ensure_minimal_diff() is called.

This makes sense, as with your answer elsewhere. What I'm confused about is why this is a parameter. Under what setting would we not wish to preserve formatting?

If you set minimal_diff=False, you get a "canonical" order of attributes of tags as they are defined in the to_xml() functions throughout the classes. A dev purpose of this was to compare the results with and without the new algorithm enabled. It just seemed natural to me to have it, but it may not be important any longer now that all tests pass.

mbollmann · 2025-06-06T21:17:29Z

This is quite a large change and I'm a little confused at the high-level purpose. It is likely because I am not yet very familiar with the library and likely won't be till I've ported an ingestion script to it or something of that kind. When I work with XML, it is usually to ingest new data (where I may need to preserve existing volume structure) or to make small modifications to author tags or something of that order. To maintain XML consistency, I've always just used indent(), which I spent some time finessing some time back.

I realize there is no clear question here but maybe there is something for you to respond to.

For ingesting new collections, this PR is (mostly) irrelevant. This is intended for modifications to existing files.

When we want to modify the XML with the library, the entire XML file will get rewritten. It's not really feasible to track which data has changed inside the Python code and only write back that, for a variety of reasons. But there are many potential sources of noise that could be introduced here: most prominently, XML attributes and tags getting written back in a different order, because in many instances this makes no semantic difference. Of course, the order of <author> tags in a <paper> block is important, but it doesn't matter in what order <title>, <publisher>, <pages>, etc. appear, or if a missing publisher is indicated by <publisher/> or not having the tag at all. This is not uniform throughout our XML in the slightest. There are more "semantically meaningless" differences, which are documented in the test cases. Notably, this has nothing at all to do with indentation, which continues to be handled by indent().

This means that in many, many, many cases (EDIT: see next comment), if we wanted to use the library to make small changes (which I really want to be able to do), like

anthology = Anthology()
paper = anthology.get("20xx.acl-long.42")
paper.authors[0] = NameSpec("Foo Bar, Baz")
anthology.save()

...this could potentially introduce hundreds of lines of diffs, even if only one line has actually meaningfully changed. I feel this would be a major barrier for actually using the library for such purposes.

mbollmann · 2025-06-06T22:09:41Z

I just tested running the following:

>>> anthology = Anthology(datadir=PosixPath('../data'), verbose=True)
>>> for collection in anthology.collections.values():
...     collection.load()
...     collection.save(minimal_diff=False)

This writes back the XML files with no changes at all to the data – this is verified by the integration tests in this PR. However:

$ git diff --shortstat
 1273 files changed, 57916 insertions(+), 57929 deletions(-)

The diff contains thousands of lines like this:

-  <volume id="1" ingest-date="2021-10-27" type="proceedings">
+  <volume id="1" type="proceedings" ingest-date="2021-10-27">

-      <attachment type="presentation" hash="1f79a932">1961.earlymt-1.2.Presentation.pdf</attachment>
+      <attachment hash="1f79a932" type="presentation">1961.earlymt-1.2.Presentation.pdf</attachment>

-      <url hash="2af34e42">1971.earlymt-1.6</url>
       <pages>77-94</pages>
+      <url hash="2af34e42">1971.earlymt-1.6</url>

+      <isbn>3-540-59040-4</isbn>
       <month>April 26–28</month>
       <year>1993</year>
-      <isbn>3-540-59040-4</isbn>

The high-level purpose of this PR is to be smart about this and prevent that.

mjpost · 2025-06-08T17:27:52Z

Thanks, the example is helpful. The need for preserving formatting is clear, as is the difficulty of doing it when you're operating through the library, rather than (as has been my experience so far) directly operating on the XML.

mjpost

I'm happy to merge this. I haven't had a close, user-level look, but also likely won't have time in the near future. Unfortunately I can't give detailed feedback until I get my hands dirty, but I also don't seen any reason to hold this up.

mjpost · 2025-06-11T14:09:48Z

Just to be clear, I'm good with this, will leave merging to you.

mbollmann self-assigned this Jun 1, 2025

mbollmann added the python-library Concerning the acl-anthology-py library label Jun 1, 2025

mbollmann added 11 commits June 3, 2025 19:53

Make pytest use difflib for list comparison asserts

ff9de3b

Make integration tests use repo data directly, add explicit encodings

7101b09

Add PaperType enum (adds support for backmatter)

aa761dc

Add support for <mrf> and change missing attachment type handling

e4c3825

Make logical XML element comparison more robust (fix bugs, add tests)

5e3330c

Fix serialization bug for <retracted> and <removed>

3065042

Add EventLinkingType, change format of Event.colocated_ids

bc93463

Make XML indentation more robust to a few edge cases

f352575

Add minimal-diff algorithm for XML serialization

8dc6af2

Run integration tests on all XML files & add docs

14a75f7

Update justfile to add FLAGS, change aliases

2a907da

mbollmann force-pushed the python-minimal-diff-xml-saving branch from 4ecb094 to 2a907da Compare June 3, 2025 19:00

mbollmann marked this pull request as ready for review June 3, 2025 19:07

mbollmann mentioned this pull request Jun 5, 2025

Missing functionality for modifying and ingesting data with Python lib #4766

Open

4 tasks

Add venues & sigs YAML roundtrip to integration tests

7246579

mjpost reviewed Jun 6, 2025

View reviewed changes

Add XML declaration to three files, remove special case in testing

1bffbe5

mbollmann mentioned this pull request Jun 8, 2025

Adapt MIT Press ingestion script to use our library #5359

Draft

mjpost approved these changes Jun 8, 2025

View reviewed changes

mbollmann merged commit cf43af2 into python-dev Jun 20, 2025
14 checks passed

mbollmann deleted the python-minimal-diff-xml-saving branch June 20, 2025 15:33

mbollmann mentioned this pull request Jun 21, 2025

Create acl-anthology release v0.5.3 #5405

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement minimal-diff XML saving #5337

Implement minimal-diff XML saving #5337

Uh oh!

mbollmann commented Jun 1, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 1, 2025 •

edited

Loading

Uh oh!

mbollmann commented Jun 3, 2025

Uh oh!

mjpost commented Jun 3, 2025

Uh oh!

mjpost left a comment

Uh oh!

mjpost Jun 6, 2025

Uh oh!

mbollmann Jun 6, 2025

Uh oh!

mjpost Jun 8, 2025

Uh oh!

mbollmann Jun 8, 2025

Uh oh!

mbollmann commented Jun 6, 2025 •

edited

Loading

Uh oh!

mbollmann commented Jun 6, 2025

Uh oh!

mjpost commented Jun 8, 2025

Uh oh!

mjpost left a comment

Uh oh!

mjpost commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

Implement minimal-diff XML saving #5337

Implement minimal-diff XML saving #5337

Uh oh!

Conversation

mbollmann commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mbollmann commented Jun 3, 2025

Uh oh!

mjpost commented Jun 3, 2025

Uh oh!

mjpost left a comment

Choose a reason for hiding this comment

Uh oh!

mjpost Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

mbollmann Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

mjpost Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

mbollmann Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

mbollmann commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbollmann commented Jun 6, 2025

Uh oh!

mjpost commented Jun 8, 2025

Uh oh!

mjpost left a comment

Choose a reason for hiding this comment

Uh oh!

mjpost commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

mbollmann commented Jun 1, 2025 •

edited

Loading

codecov bot commented Jun 1, 2025 •

edited

Loading

mbollmann commented Jun 6, 2025 •

edited

Loading