-
-
Notifications
You must be signed in to change notification settings - Fork 32.4k
Description
Bug report
Bug description:
I believe there is a bug in xml.dom.xmlbuilder.parse()
.
xml.dom.xmlbuilder
is not documented by python, we don't have any tests for
them, and I can't find evidence on github of these classes being in use. That's
okay, because in this case the classes are implementations of a w3c standard
which was at the time a draft, and which is heavily documented.
First, some digital archeology. The first commit of Lib/xml/dom/xmlbuilder.py
is from 2003-01-25, with the message "Import from PyXML 1.10."
The issue already exists in that commit.
The earlier remains of the PyXML project are on sourceforce
in a cvs repo. Looking at the log there, I find that the first commit for xmlbuilder.py
is from 2002-02-21 with this message:
Preliminary implementation of the new direct-from-Expat minidom builder.
xmlbuilder contains the basic interface code for the DOM Level 3 "LS-Load"
feature; it can certainly be generalized a little more, but this seems a
reasonable start. expatbuilder contains the code that xmlbuilder uses to
glue minidom and Expat together. There's nothing general there! ;-)
The bug is already present in this version as well.
The file sees a few more commits, four of which reference updating to track
the new version of the draft standard, the latest of which is from 2002-08-02.
None of the commits since being imported into CPython reference the standard.
This means that in looking at the code, our best reference is the 2002-07-25 draft
version: https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/.
What's called DOMBuilder in that draft and in our implementation, would in later
drafts be renamed to DOMParser and then LSParser, but the basic class remains
in the standard.
Going to the standard now:
The primary interface in the standard is DOMImplementationLS, a mixin for the
DOMImplementation
interface. The relevant methods for us are:
DOMBuilder createDOMBuilder(in unsigned short mode, in DOMString schemaType)raises(DOMException);
DOMInputSource createDOMInputSource();
Described as:
createDOMBuilder
Create a new DOMBuilder. The newly constructed parser may then be configured by means of its setFeature method, and used to parse documents by means of its parse method.createDOMInputSource
Create a new "empty" DOMInputSource.
The documentation for DOMInputSource says:
This interface represents a single input source for an XML entity.
This interface allows an application to encapsulate information about an input source in a single object, which may include a public identifier, a system identifier, a byte stream (possibly with a specified encoding), and/or a character stream.
The DOMBuilder will use the DOMInputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly; if not, the parser will use a byte stream, if available; if neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
A DOMInputSource object belongs to the application: the parser shall never modify it in any way (it may modify a copy if necessary).
Finally, documentation for the DOMBuilder interface:
A interface to an object that is able to build a DOM tree from various input sources.
DOMBuilder provides an API for parsing XML documents and building the corresponding DOM document tree. A DOMBuilder instance is obtained from the DOMImplementationLS interface by invoking its createDOMBuildermethod.
DOMBuilders have a number of named features that can be queried or set. The name of DOMBuilder features must be valid XML names. Implementation specific features (extensions) should choose a implementation specific prefix to avoid name collisions.
Even if all features must be recognized by all implementations, being able to set a state (true or false) is not always required. The following list of recognized features indicates the definitions of each feature state, if setting the state to true or false must be supported or is optional and, which state is the default one:
And its parse method:
parse
Parse an XML document from a resource identified by a DOMInputSource.
Parameters:
is of type DOMInputSource
The DOMInputSource from which the source document is to be read.
Return Value:
Document
So an application using this interface is intended to:
- Get a hold of a DOMImplementation object. This is documented in python,
and accomplished viaxml.dom.getDOMImplementation()
. - Next, call
createDOMBuilder()
andcreateDOMInputSource()
to get
those two respective objects. - Configure the options for the DOMBuilder using the methods
canSetFeature
,
getFeature
, andsetFeature
. - Configure the values of the DOMInputSource as needed. As noted above,
DOMInput source will first prefer itscharacterStream
attribute. If that
attribute is null, it will prefer thebyteStream
attribute. If both of
these are null, then the URI from itssystemId
attribute is used. - Finally, pass the
DOMInputSource
toDOMBuilder.parse()
to receive a
Document
object.
In Python, this looks like this:
from xml.dom import getDOMImplementation
imp = getDOMImplementation()
builder = imp.createDOMBuilder(imp.MODE_SYNCHRONOUS, None)
# configure the builder, if needed
source = imp.createDOMInputSource()
# set one or more attributes on the "empty" source object
source.systemId = "http://www.w3.org/2000/svg"
document = builder.parse(source)
# Do whatever with the document
For a use case like this, with no additional customization of the source object,
the spec also has a DOMBuilder.parseURI()
method which accepts a URI directly and
constructs the necessary DOMInputSource internally.
With all that as background, here's our current implementation of DOMBuilder.parse()
:
cpython/Lib/xml/dom/xmlbuilder.py
Lines 187 to 195 in aeb9b65
def parse(self, input): | |
options = copy.copy(self._options) | |
options.filter = self.filter | |
options.errorHandler = self.errorHandler | |
fp = input.byteStream | |
if fp is None and options.systemId: | |
import urllib.request | |
fp = urllib.request.urlopen(input.systemId) | |
return self._parse_bytestream(fp, options) |
The problem is on line 192. systemId
is not an attribute that exists on options
here.
It's an instance of the xml.dom.xmlbuilder.Options
class and represents the
features that can be configured on DOMBuilder
. Those options are documented in
the specification, and are set using DOMBuilder.setFeature()
. Our implementation
has a dictionary DOMBuilder._settings
which contains the options from the
spec that we support, which starts here:
cpython/Lib/xml/dom/xmlbuilder.py
Lines 101 to 103 in aeb9b65
_settings = { | |
("namespace_declarations", 0): [ | |
("namespace_declarations", 0)], |
The important thing is that systemId
is not an configuration setting in either
the specificiation or in our implementation, so as written we can never reach
the if branch in that method.
I believe that the relevant lines should be:
fp = input.byteStream
if fp is None and input.systemId:
import urllib.request
fp = urllib.request.urlopen(input.systemId)
(The implementation is also non-conformant because it fails to consider
input.characterStream
before using input.byteStream
, but that's a different
probem. I'm not looking to add a missing feature right now, just fix the already-existing-but-broken one.)
CPython versions tested on:
3.13
Operating systems tested on:
No response