Skip to content

Bug in xml.dom.xmlbuilder.DOMBuilder.parse() #128302

@tungol

Description

@tungol

Bug report

Bug description:

I believe there is a bug in xml.dom.xmlbuilder.parse().

xml.dom.xmlbuilder is not documented by python, we don't have any tests for
them, and I can't find evidence on github of these classes being in use. That's
okay, because in this case the classes are implementations of a w3c standard
which was at the time a draft, and which is heavily documented.

First, some digital archeology. The first commit of Lib/xml/dom/xmlbuilder.py
is from 2003-01-25, with the message "Import from PyXML 1.10."
The issue already exists in that commit.

The earlier remains of the PyXML project are on sourceforce
in a cvs repo. Looking at the log there, I find that the first commit for xmlbuilder.py
is from 2002-02-21 with this message:

Preliminary implementation of the new direct-from-Expat minidom builder.
xmlbuilder contains the basic interface code for the DOM Level 3 "LS-Load"
feature; it can certainly be generalized a little more, but this seems a
reasonable start.  expatbuilder contains the code that xmlbuilder uses to
glue minidom and Expat together.  There's nothing general there!  ;-)

The bug is already present in this version as well.

The file sees a few more commits, four of which reference updating to track
the new version of the draft standard, the latest of which is from 2002-08-02.
None of the commits since being imported into CPython reference the standard.
This means that in looking at the code, our best reference is the 2002-07-25 draft
version: https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/.

What's called DOMBuilder in that draft and in our implementation, would in later
drafts be renamed to DOMParser and then LSParser, but the basic class remains
in the standard.

Going to the standard now:

The primary interface in the standard is DOMImplementationLS, a mixin for the
DOMImplementation interface. The relevant methods for us are:

DOMBuilder         createDOMBuilder(in unsigned short mode, in DOMString schemaType)raises(DOMException);
DOMInputSource     createDOMInputSource();

Described as:

createDOMBuilder
Create a new DOMBuilder. The newly constructed parser may then be configured by means of its setFeature method, and used to parse documents by means of its parse method.

createDOMInputSource
Create a new "empty" DOMInputSource.

The documentation for DOMInputSource says:

This interface represents a single input source for an XML entity.

This interface allows an application to encapsulate information about an input source in a single object, which may include a public identifier, a system identifier, a byte stream (possibly with a specified encoding), and/or a character stream.

The DOMBuilder will use the DOMInputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly; if not, the parser will use a byte stream, if available; if neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

A DOMInputSource object belongs to the application: the parser shall never modify it in any way (it may modify a copy if necessary).

Finally, documentation for the DOMBuilder interface:

A interface to an object that is able to build a DOM tree from various input sources.

DOMBuilder provides an API for parsing XML documents and building the corresponding DOM document tree. A DOMBuilder instance is obtained from the DOMImplementationLS interface by invoking its createDOMBuildermethod.

DOMBuilders have a number of named features that can be queried or set. The name of DOMBuilder features must be valid XML names. Implementation specific features (extensions) should choose a implementation specific prefix to avoid name collisions.

Even if all features must be recognized by all implementations, being able to set a state (true or false) is not always required. The following list of recognized features indicates the definitions of each feature state, if setting the state to true or false must be supported or is optional and, which state is the default one:

And its parse method:

parse
Parse an XML document from a resource identified by a DOMInputSource.
Parameters:
is of type DOMInputSource
The DOMInputSource from which the source document is to be read.
Return Value:
Document

So an application using this interface is intended to:

  1. Get a hold of a DOMImplementation object. This is documented in python,
    and accomplished via xml.dom.getDOMImplementation().
  2. Next, call createDOMBuilder() and createDOMInputSource() to get
    those two respective objects.
  3. Configure the options for the DOMBuilder using the methods canSetFeature,
    getFeature, and setFeature.
  4. Configure the values of the DOMInputSource as needed. As noted above,
    DOMInput source will first prefer its characterStream attribute. If that
    attribute is null, it will prefer the byteStream attribute. If both of
    these are null, then the URI from its systemId attribute is used.
  5. Finally, pass the DOMInputSource to DOMBuilder.parse() to receive a
    Document object.

In Python, this looks like this:

from xml.dom import getDOMImplementation

imp = getDOMImplementation()
builder = imp.createDOMBuilder(imp.MODE_SYNCHRONOUS, None)
# configure the builder, if needed
source = imp.createDOMInputSource()
# set one or more attributes on the "empty" source object
source.systemId = "http://www.w3.org/2000/svg"
document = builder.parse(source)
# Do whatever with the document

For a use case like this, with no additional customization of the source object,
the spec also has a DOMBuilder.parseURI() method which accepts a URI directly and
constructs the necessary DOMInputSource internally.

With all that as background, here's our current implementation of DOMBuilder.parse():

def parse(self, input):
options = copy.copy(self._options)
options.filter = self.filter
options.errorHandler = self.errorHandler
fp = input.byteStream
if fp is None and options.systemId:
import urllib.request
fp = urllib.request.urlopen(input.systemId)
return self._parse_bytestream(fp, options)

The problem is on line 192. systemId is not an attribute that exists on options here.
It's an instance of the xml.dom.xmlbuilder.Options class and represents the
features that can be configured on DOMBuilder. Those options are documented in
the specification, and are set using DOMBuilder.setFeature(). Our implementation
has a dictionary DOMBuilder._settings which contains the options from the
spec that we support, which starts here:

_settings = {
("namespace_declarations", 0): [
("namespace_declarations", 0)],

The important thing is that systemId is not an configuration setting in either
the specificiation or in our implementation, so as written we can never reach
the if branch in that method.

I believe that the relevant lines should be:

        fp = input.byteStream
        if fp is None and input.systemId:
            import urllib.request
            fp = urllib.request.urlopen(input.systemId)

(The implementation is also non-conformant because it fails to consider
input.characterStream before using input.byteStream, but that's a different
probem. I'm not looking to add a missing feature right now, just fix the already-existing-but-broken one.)

CPython versions tested on:

3.13

Operating systems tested on:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtopic-XMLtype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions