Skip to content

[DISCUSSION] Reasons to keep URIRef, BlankNode, and Literal as subclasses of str? #2866

@ashleysommer

Description

@ashleysommer

There is a particular design choice in RDFLib that has existed since the original version in 2001.

URIRef, BlankNode, and Literal are subclasses of Identifier.
Identifer is a combination class, a subclass of both str and the abstract class Node.

It has been this way for a long time. In the Py2->3 transition days, Identifier used to subclass six.text_type. And before that it was a subclass of Python 2's unicode type, right back until the first tracked commit. (Note, python2 unicode type was renamed to str for python3, it is literally the same type. Python2 str became bytes in python3).

Its common to see very old python libraries using a similar pattern. There was good reason. Subclasses of str were treated as "real strings" by the system, that enabled better memory management and provided built-in text features like startswith(), lower(), etc, and comparisons like eq, gt, lt, gte, lte that the subclass gets for free and work exactly like a native string.

Additionally in Python prior to v2.1 "string interning" worked on subclasses of strings. That means that short constant strings, and strings managed using sys.intern() were managed using a global string table, were deduplicated in memory and had shortcuts like pre-calculated hashs. But for Python 2.1+, that was considered a bug and it was fixed so subclasses of str could no longer be interned.

One thing to note however, in the Python2 days, Identifier wasn't a subclass of str, but unicode which I don't think had an intern table, so RDFLib never benefited from that.

That pattern in RDFLib has been maintained for 21 years without question, because "its always been that way", but I suggest it is hurting RDFLib's usability and performance. The main cause being an enormous amount of needless copying strings that happens when creating URIRef, BlankNode, and Literal instances, and reading them back out.

Examples:

>>> from rdflib import URIRef
>>> mystring = "https://example.org/example"
>>> id(mystring)
126135362697728
>>> myuri = URIRef(mystring)
>>> id(myuri)
126135340674256
>>> innerstring1 = str(myuri)
>>> id(innerstring1)
126135358248864
>>> innerstring1 is mystring
False
>>> innerstring2 = str.__str__(myuri)
>>> id(innerstring2)
126135353645280
>>> innerstring2 is mystring
False
>>> innerstring3 = super(Identifier, myuri).__str__()
>>> id(innerstring3)
126135350065744
>>> innerstring3 is mystring
False

Firstly, when you create a URIRef (or any subclass of Identifier) and pass in a str, the __new__ constructor will hand that directly into the internal string constructor (return str.__new__(input)) however python knows its really part of a subclass of str, so it doesn't use the copy-on-write (CoW) deduplication optimisation, instead it takes a COPY of the string passed in, and saves it.

But then there is now no way of getting back that string it copied and saved. If you want to serialize that URIRef or treat it like a string you can read from, you need to call str(myuri) on it. But that doesn't give you back the original string, or even the first copy it made. It makes a new COPY of the internal string. Even if you try to treat the variable as a real string, calling str.__str__(myuri) does the same thing, it makes another new copy. Finally, if you try to tap into the innerstring using super().__str__() it gives you the same result. There is no way to read the inner contents of a subclass of string without copying it.

Python3 can still do intern strings. It is a major memory management and performance speedup. All static constant strings in your code that are 20 characters or less will be saved by the parser as interned strings, and will always refer to the same object.

>>> "myconstant"  #short constants are interened automatically
'myconstant'
>>> myvar1 = "myconstant"
>>> myvar2 = str("myconstant")
>>> myvar1 is myvar2
True

and for a long string:

>>> myconstant = sys.intern("https://example.org/example")
>>> myvar1 = str(myconstant)
>>> myvar2 = str(myconstant)
>>> myconstant is myvar1 and myvar1 is myvar2
True

You can see each of the different variables and new strings that were created with the interned string are instances of the exact same string in memory.

RDFLib cannot take advantage of this however, because string interning DOES NOT work on subclasses of str.

>>> myconstant = sys.intern("https://example.org/example")
>>> myvar1 = URIRef(myconstant)
>>> myvar2 = URIRef(myconstant)
>>> myconstant is myvar1
False
>>> myvar1 is myvar2
False

The reason is to do with immutability. Python treats all real strings as immutable. Any operation you do on a string will leave the original string untouched and give you back a new string. The guarantee of string immutability is something provided by the stringlib type interface in stdlib. That is why strings can be interned, identical interned strings will return the same object because they are immutable. However Python can't make those same guarantees about immutability on subclasses of strings. A subclass can change its internal contents while identifying as the same instance. That is why python must COPY the source string. Passing a real str into URIRef() constructor takes a copy of the original string, it can't use an internal reference to the original string because the copy is mutable, it needs to retain the original as immutable. Similarly when reading it back out as a str, you can't simply read out the internal state as a string itself, because python needs to make an immutable snapshot of the subclass, it does that by taking a COPY.

I was half way through making a PR to improve the performance and memory usage of URIRef creation and serialization by taking advantage of sys.intern() for frequently used URIRefs, when I started putting all the pieces of this issue together.

It looks like RDFLib will need a fundamental shift in architecture of its most used foundation classes if we want to resolve this, so I think it will be a good issue to tackle for v8.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions