-
Notifications
You must be signed in to change notification settings - Fork 354
Implement minimal-diff XML saving #5337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## python-dev #5337 +/- ##
==============================================
+ Coverage 93.67% 93.75% +0.08%
==============================================
Files 35 35
Lines 2781 2850 +69
==============================================
+ Hits 2605 2672 +67
- Misses 176 178 +2
🚀 New features to boost your workflow:
|
4ecb094
to
2a907da
Compare
@mjpost I would appreciate a review on this if you have time. It implements a feature I've been meaning to work on for a long time now — guaranteeing that saving Python objects back to XML is non-destructive and causes minimal diffs — to make it safe & attractive to use this library for any kind of modifications to our data. I’ve rebased & force-pushed this entire branch to clean up the commit history, so it may be easiest to read commit by commit. (8dc6af2 and maybe bc93463 are the big ones.) |
Sure thing, I'll do this asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a large change and I'm a little confused at the high-level purpose. It is likely because I am not yet very familiar with the library and likely won't be till I've ported an ingestion script to it or something of that kind. When I work with XML, it is usually to ingest new data (where I may need to preserve existing volume structure) or to make small modifications to author tags or something of that order. To maintain XML consistency, I've always just used indent()
, which I spent some time finessing some time back.
I realize there is no clear question here but maybe there is something for you to respond to.
"""Saves this collection as an XML file. | ||
|
||
Arguments: | ||
path: The filename to save to. If None, defaults to `self.path`. | ||
minimal_diff: If True (default), will compare against an existing XML file in `self.path` to minimize the difference, i.e., to prevent noise from changes in the XML that make no semantic difference. See [`utils.xml.ensure_minimal_diff`][acl_anthology.utils.xml.ensure_minimal_diff] for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the high-level PR description, it seems odd to me that this would be a flag. In the old code that I'm familiar with, the XML elements are aware of their "tails", which contains the formatted whitespace between tags, allowing perfect reconstruction. Why do you have this argument here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has nothing at all to do with tails or whitespace or indentation, so I'm a little confused how to answer. Hopefully my explanation below makes it clear. Otherwise, there are plenty of test cases in this PR that document the expected behaviour when ensure_minimal_diff()
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, as with your answer elsewhere. What I'm confused about is why this is a parameter. Under what setting would we not wish to preserve formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set minimal_diff=False
, you get a "canonical" order of attributes of tags as they are defined in the to_xml()
functions throughout the classes. A dev purpose of this was to compare the results with and without the new algorithm enabled. It just seemed natural to me to have it, but it may not be important any longer now that all tests pass.
For ingesting new collections, this PR is (mostly) irrelevant. This is intended for modifications to existing files. When we want to modify the XML with the library, the entire XML file will get rewritten. It's not really feasible to track which data has changed inside the Python code and only write back that, for a variety of reasons. But there are many potential sources of noise that could be introduced here: most prominently, XML attributes and tags getting written back in a different order, because in many instances this makes no semantic difference. Of course, the order of This means that in many, many, many cases (EDIT: see next comment), if we wanted to use the library to make small changes (which I really want to be able to do), like anthology = Anthology()
paper = anthology.get("20xx.acl-long.42")
paper.authors[0] = NameSpec("Foo Bar, Baz")
anthology.save() ...this could potentially introduce hundreds of lines of diffs, even if only one line has actually meaningfully changed. I feel this would be a major barrier for actually using the library for such purposes. |
I just tested running the following: >>> anthology = Anthology(datadir=PosixPath('../data'), verbose=True)
>>> for collection in anthology.collections.values():
... collection.load()
... collection.save(minimal_diff=False) This writes back the XML files with no changes at all to the data – this is verified by the integration tests in this PR. However: $ git diff --shortstat
1273 files changed, 57916 insertions(+), 57929 deletions(-) The diff contains thousands of lines like this: - <volume id="1" ingest-date="2021-10-27" type="proceedings">
+ <volume id="1" type="proceedings" ingest-date="2021-10-27"> - <attachment type="presentation" hash="1f79a932">1961.earlymt-1.2.Presentation.pdf</attachment>
+ <attachment hash="1f79a932" type="presentation">1961.earlymt-1.2.Presentation.pdf</attachment> - <url hash="2af34e42">1971.earlymt-1.6</url>
<pages>77-94</pages>
+ <url hash="2af34e42">1971.earlymt-1.6</url> + <isbn>3-540-59040-4</isbn>
<month>April 26–28</month>
<year>1993</year>
- <isbn>3-540-59040-4</isbn> The high-level purpose of this PR is to be smart about this and prevent that. |
Thanks, the example is helpful. The need for preserving formatting is clear, as is the difficulty of doing it when you're operating through the library, rather than (as has been my experience so far) directly operating on the XML. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to merge this. I haven't had a close, user-level look, but also likely won't have time in the near future. Unfortunately I can't give detailed feedback until I get my hands dirty, but I also don't seen any reason to hold this up.
Just to be clear, I'm good with this, will leave merging to you. |
Implements minimal-diff XML saving, and adds integration tests to ensure that all XML files can be loaded and saved again without loss of information, as a step towards #4766. Concretely, this PR:
Edge cases:addressed<mrf>
tags and XML comments are currently not preserved.Exception: Superfluous empty tags that currently exist in some XML files.addressedException:addressed<colocated>
blocks that do not exhaustively list all colocated volumes.<mrf>
tags, refactorsEvent.colocated_ids
to make clear which IDs where defined in the XML and which were inferred automatically, adds PaperType to indicate frontmatter & backmatter (the latter was not previously supported).Work in progress, currently has an initial implementation that works on the "Toy Anthology" in the tests folder, but not on the full one yet.Works on the full Anthology data now,after fixing a few minor things in the XML and defining plenty of exception rules and xfails.without exception.TODOs: