Bug in `xml.dom.xmlbuilder.DOMBuilder.parse()`

# Bug report

### Bug description:

I believe there is a bug in `xml.dom.xmlbuilder.parse()`.

`xml.dom.xmlbuilder` is not documented by python, we don't have any tests for
them, and I can't find evidence on github of these classes being in use. That's
okay, because in this case the classes are implementations of a w3c standard
which was at the time a draft, and which is heavily documented.

First, some digital archeology. The first commit of `Lib/xml/dom/xmlbuilder.py`
is from 2003-01-25, with the message ["Import from PyXML 1.10."](https://github.com/python/cpython/commit/9ea67889dc8fecbd2e57e2c3a5a65f1f69b7c1dc)
The issue already exists in that commit.

The earlier remains of the [PyXML project are on sourceforce](https://sourceforge.net/projects/pyxml/)
in a cvs repo. Looking at the log there, I find that the first commit for xmlbuilder.py
is from 2002-02-21 with this message:

```
Preliminary implementation of the new direct-from-Expat minidom builder.
xmlbuilder contains the basic interface code for the DOM Level 3 "LS-Load"
feature; it can certainly be generalized a little more, but this seems a
reasonable start.  expatbuilder contains the code that xmlbuilder uses to
glue minidom and Expat together.  There's nothing general there!  ;-)
```

The bug is already present in this version as well.

The file sees a few more commits, four of which reference updating to track
the new version of the draft standard, the latest of which is from 2002-08-02.
None of the commits since being imported into CPython reference the standard.
This means that in looking at the code, our best reference is the 2002-07-25 draft
version: https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/.

What's called DOMBuilder in that draft and in our implementation, would in later
drafts be renamed to DOMParser and then LSParser, but the basic class remains
in the standard.

Going to the standard now:

The primary interface in the standard is [DOMImplementationLS](https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/DOM3-LS.html#load-save-LS-DOMImplementation), a mixin for the
`DOMImplementation` interface. The relevant methods for us are:

```
DOMBuilder         createDOMBuilder(in unsigned short mode, in DOMString schemaType)raises(DOMException);
DOMInputSource     createDOMInputSource();
```

Described as:

> createDOMBuilder
>     Create a new DOMBuilder. The newly constructed parser may then be configured by means of its setFeature method, and used to parse documents by means of its parse method. 
> 
> createDOMInputSource
>     Create a new "empty" DOMInputSource. 

The documentation for [DOMInputSource](https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/DOM3-LS.html#load-save-LS-DOMInputSource) says:

> This interface represents a single input source for an XML entity.
> 
> This interface allows an application to encapsulate information about an input source in a single object, which may include a public identifier, a system identifier, a byte stream (possibly with a specified encoding), and/or a character stream.
> 
> The DOMBuilder will use the DOMInputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly; if not, the parser will use a byte stream, if available; if neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
> 
> A DOMInputSource object belongs to the application: the parser shall never modify it in any way (it may modify a copy if necessary).

Finally, documentation for the [DOMBuilder](https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/DOM3-LS.html#load-save-LS-DOMBuilder) interface:

> A interface to an object that is able to build a DOM tree from various input sources.
> 
> DOMBuilder provides an API for parsing XML documents and building the corresponding DOM document tree. A DOMBuilder instance is obtained from the DOMImplementationLS interface by invoking its createDOMBuildermethod.
> 
> DOMBuilders have a number of named features that can be queried or set. The name of DOMBuilder features must be valid XML names. Implementation specific features (extensions) should choose a implementation specific prefix to avoid name collisions.
> 
> Even if all features must be recognized by all implementations, being able to set a state (true or false) is not always required. The following list of recognized features indicates the definitions of each feature state, if setting the state to true or false must be supported or is optional and, which state is the default one:


And its [parse method](https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/DOM3-LS.html#load-save-LS-DOMBuilder-parse):

> parse
>     Parse an XML document from a resource identified by a DOMInputSource.
>     Parameters:
>     is of type DOMInputSource
>         The DOMInputSource from which the source document is to be read.
>     Return Value:
>     Document

So an application using this interface is intended to:

1. Get a hold of a DOMImplementation object. This is documented in python,
   and accomplished via `xml.dom.getDOMImplementation()`.
2. Next, call `createDOMBuilder()` and `createDOMInputSource()` to get
   those two respective objects.
3. Configure the options for the DOMBuilder using the methods `canSetFeature`,
   `getFeature`, and `setFeature`.
4. Configure the values of the DOMInputSource as needed. As noted above,
   DOMInput source will first prefer its `characterStream` attribute. If that
   attribute is null, it will prefer the `byteStream` attribute. If both of
   these are null, then the URI from its `systemId` attribute is used.
5. Finally, pass the `DOMInputSource` to `DOMBuilder.parse()` to receive a
   `Document` object.

In Python, this looks like this:

```python
from xml.dom import getDOMImplementation

imp = getDOMImplementation()
builder = imp.createDOMBuilder(imp.MODE_SYNCHRONOUS, None)
# configure the builder, if needed
source = imp.createDOMInputSource()
# set one or more attributes on the "empty" source object
source.systemId = "http://www.w3.org/2000/svg"
document = builder.parse(source)
# Do whatever with the document
```

For a use case like this, with no additional customization of the source object,
the spec also has a `DOMBuilder.parseURI()` method which accepts a URI directly and
constructs the necessary DOMInputSource internally.

With all that as background, here's our current implementation of `DOMBuilder.parse()`:

https://github.com/python/cpython/blob/aeb9b65aa26444529e4adc7d6e5b0d3dd9889ec2/Lib/xml/dom/xmlbuilder.py#L187-L195

The problem is on line 192. `systemId` is not an attribute that exists on `options` here.
It's an instance of the `xml.dom.xmlbuilder.Options` class and represents the
features that can be configured on `DOMBuilder`. Those options are documented in
the specification, and are set using `DOMBuilder.setFeature()`. Our implementation
has a dictionary `DOMBuilder._settings` which contains the options from the
spec that we support, which starts here:

https://github.com/python/cpython/blob/aeb9b65aa26444529e4adc7d6e5b0d3dd9889ec2/Lib/xml/dom/xmlbuilder.py#L101-L103

The important thing is that `systemId` is not an configuration setting in either
the specificiation or in our implementation, so as written we can never reach
the if branch in that method.

I believe that the relevant lines should be:

```python
        fp = input.byteStream
        if fp is None and input.systemId:
            import urllib.request
            fp = urllib.request.urlopen(input.systemId)
```

(The implementation is also non-conformant because it fails to consider
`input.characterStream` before using `input.byteStream`, but that's a different
probem. I'm not looking to add a missing feature right now, just fix the already-existing-but-broken one.)

### CPython versions tested on:

3.13

### Operating systems tested on:

_No response_


### Linked PRs
* gh-128284
* gh-128582
* gh-128583

	def parse(self, input):
	options = copy.copy(self._options)
	options.filter = self.filter
	options.errorHandler = self.errorHandler
	fp = input.byteStream
	if fp is None and options.systemId:
	import urllib.request
	fp = urllib.request.urlopen(input.systemId)
	return self._parse_bytestream(fp, options)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug in `xml.dom.xmlbuilder.DOMBuilder.parse()` #128302

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	_settings = {
	("namespace_declarations", 0): [
	("namespace_declarations", 0)],

Uh oh!

Bug in xml.dom.xmlbuilder.DOMBuilder.parse() #128302

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug in `xml.dom.xmlbuilder.DOMBuilder.parse()` #128302