Skip to content

Commit 1262e41

Browse files
erlend-aaslandAlexWaygoodCAM-GerlachCorvinMezio-melotti
authored
gh-108590: Improve sqlite3 docs on encoding issues and how to handle those (#108699)
Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
1 parent 81ed80d commit 1262e41

File tree

1 file changed

+50
-33
lines changed

1 file changed

+50
-33
lines changed

Doc/library/sqlite3.rst

Lines changed: 50 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1154,6 +1154,10 @@ Connection objects
11541154
f.write('%s\n' % line)
11551155
con.close()
11561156

1157+
.. seealso::
1158+
1159+
:ref:`sqlite3-howto-encoding`
1160+
11571161

11581162
.. method:: backup(target, *, pages=-1, progress=None, name="main", sleep=0.250)
11591163

@@ -1220,6 +1224,10 @@ Connection objects
12201224

12211225
.. versionadded:: 3.7
12221226

1227+
.. seealso::
1228+
1229+
:ref:`sqlite3-howto-encoding`
1230+
12231231
.. method:: getlimit(category, /)
12241232

12251233
Get a connection runtime limit.
@@ -1441,39 +1449,8 @@ Connection objects
14411449
and returns a text representation of it.
14421450
The callable is invoked for SQLite values with the ``TEXT`` data type.
14431451
By default, this attribute is set to :class:`str`.
1444-
If you want to return ``bytes`` instead, set *text_factory* to ``bytes``.
14451452

1446-
Example:
1447-
1448-
.. testcode::
1449-
1450-
con = sqlite3.connect(":memory:")
1451-
cur = con.cursor()
1452-
1453-
AUSTRIA = "Österreich"
1454-
1455-
# by default, rows are returned as str
1456-
cur.execute("SELECT ?", (AUSTRIA,))
1457-
row = cur.fetchone()
1458-
assert row[0] == AUSTRIA
1459-
1460-
# but we can make sqlite3 always return bytestrings ...
1461-
con.text_factory = bytes
1462-
cur.execute("SELECT ?", (AUSTRIA,))
1463-
row = cur.fetchone()
1464-
assert type(row[0]) is bytes
1465-
# the bytestrings will be encoded in UTF-8, unless you stored garbage in the
1466-
# database ...
1467-
assert row[0] == AUSTRIA.encode("utf-8")
1468-
1469-
# we can also implement a custom text_factory ...
1470-
# here we implement one that appends "foo" to all strings
1471-
con.text_factory = lambda x: x.decode("utf-8") + "foo"
1472-
cur.execute("SELECT ?", ("bar",))
1473-
row = cur.fetchone()
1474-
assert row[0] == "barfoo"
1475-
1476-
con.close()
1453+
See :ref:`sqlite3-howto-encoding` for more details.
14771454

14781455
.. attribute:: total_changes
14791456

@@ -1632,7 +1609,6 @@ Cursor objects
16321609
COMMIT;
16331610
""")
16341611

1635-
16361612
.. method:: fetchone()
16371613

16381614
If :attr:`~Cursor.row_factory` is ``None``,
@@ -2611,6 +2587,47 @@ With some adjustments, the above recipe can be adapted to use a
26112587
instead of a :class:`~collections.namedtuple`.
26122588

26132589

2590+
.. _sqlite3-howto-encoding:
2591+
2592+
How to handle non-UTF-8 text encodings
2593+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2594+
2595+
By default, :mod:`!sqlite3` uses :class:`str` to adapt SQLite values
2596+
with the ``TEXT`` data type.
2597+
This works well for UTF-8 encoded text, but it might fail for other encodings
2598+
and invalid UTF-8.
2599+
You can use a custom :attr:`~Connection.text_factory` to handle such cases.
2600+
2601+
Because of SQLite's `flexible typing`_, it is not uncommon to encounter table
2602+
columns with the ``TEXT`` data type containing non-UTF-8 encodings,
2603+
or even arbitrary data.
2604+
To demonstrate, let's assume we have a database with ISO-8859-2 (Latin-2)
2605+
encoded text, for example a table of Czech-English dictionary entries.
2606+
Assuming we now have a :class:`Connection` instance :py:data:`!con`
2607+
connected to this database,
2608+
we can decode the Latin-2 encoded text using this :attr:`~Connection.text_factory`:
2609+
2610+
.. testcode::
2611+
2612+
con.text_factory = lambda data: str(data, encoding="latin2")
2613+
2614+
For invalid UTF-8 or arbitrary data in stored in ``TEXT`` table columns,
2615+
you can use the following technique, borrowed from the :ref:`unicode-howto`:
2616+
2617+
.. testcode::
2618+
2619+
con.text_factory = lambda data: str(data, errors="surrogateescape")
2620+
2621+
.. note::
2622+
2623+
The :mod:`!sqlite3` module API does not support strings
2624+
containing surrogates.
2625+
2626+
.. seealso::
2627+
2628+
:ref:`unicode-howto`
2629+
2630+
26142631
.. _sqlite3-explanation:
26152632

26162633
Explanation

0 commit comments

Comments
 (0)