Skip to content

Commit 1c33a28

Browse files
Make charset auto-detection optional. (#2165)
* Add Response(..., default_encoding=...) * Add tests for Response(..., default_encoding=...) * Add Client(..., default_encoding=...) * Switch default encoding to 'utf-8' instead of 'autodetect' * Make charset_normalizer an optional dependancy, not a mandatory one. * Documentation * Use callable for default_encoding * Update tests for new charset autodetection API * Update docs for new charset autodetection API * Update requirements * Drop charset_normalizer from requirements
1 parent 940d61b commit 1c33a28

File tree

11 files changed

+245
-54
lines changed

11 files changed

+245
-54
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,6 @@ The HTTPX project relies on these excellent libraries:
128128
* `httpcore` - The underlying transport implementation for `httpx`.
129129
* `h11` - HTTP/1.1 support.
130130
* `certifi` - SSL certificates.
131-
* `charset_normalizer` - Charset auto-detection.
132131
* `rfc3986` - URL parsing & normalization.
133132
* `idna` - Internationalized domain name support.
134133
* `sniffio` - Async library autodetection.

README_chinese.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,6 @@ HTTPX项目依赖于这些优秀的库:
129129
* `h11` - HTTP/1.1 support.
130130
* `h2` - HTTP/2 support. *(Optional, with `httpx[http2]`)*
131131
* `certifi` - SSL certificates.
132-
* `charset_normalizer` - Charset auto-detection.
133132
* `rfc3986` - URL parsing & normalization.
134133
* `idna` - Internationalized domain name support.
135134
* `sniffio` - Async library autodetection.

docs/advanced.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,88 @@ URL('http://httpbin.org/headers')
145145

146146
For a list of all available client parameters, see the [`Client`](api.md#client) API reference.
147147

148+
---
149+
150+
## Character set encodings and auto-detection
151+
152+
When accessing `response.text`, we need to decode the response bytes into a unicode text representation.
153+
154+
By default `httpx` will use `"charset"` information included in the response `Content-Type` header to determine how the response bytes should be decoded into text.
155+
156+
In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.
157+
158+
### Using the default encoding
159+
160+
To understand this better let's start by looking at the default behaviour for text decoding...
161+
162+
```python
163+
import httpx
164+
# Instantiate a client with the default configuration.
165+
client = httpx.Client()
166+
# Using the client...
167+
response = client.get(...)
168+
print(response.encoding) # This will either print the charset given in
169+
# the Content-Type charset, or else "utf-8".
170+
print(response.text) # The text will either be decoded with the Content-Type
171+
# charset, or using "utf-8".
172+
```
173+
174+
This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.
175+
176+
### Using an explicit encoding
177+
178+
In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.
179+
180+
```python
181+
import httpx
182+
# Instantiate a client with a Japanese character set as the default encoding.
183+
client = httpx.Client(default_encoding="shift-jis")
184+
# Using the client...
185+
response = client.get(...)
186+
print(response.encoding) # This will either print the charset given in
187+
# the Content-Type charset, or else "shift-jis".
188+
print(response.text) # The text will either be decoded with the Content-Type
189+
# charset, or using "shift-jis".
190+
```
191+
192+
### Using character set auto-detection
193+
194+
In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.
195+
196+
To use auto-detection you need to set the `default_encoding` argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.
197+
198+
There are two widely used Python packages which both handle this functionality:
199+
200+
* [`chardet`](https://chardet.readthedocs.io/) - This is a well established package, and is a port of [the auto-detection code in Mozilla](https://www-archive.mozilla.org/projects/intl/chardet.html).
201+
* [`charset-normalizer`](https://charset-normalizer.readthedocs.io/) - A newer package, motivated by `chardet`, with a different approach.
202+
203+
Let's take a look at installing autodetection using one of these packages...
204+
205+
```shell
206+
$ pip install httpx
207+
$ pip install chardet
208+
```
209+
210+
Once `chardet` is installed, we can configure a client to use character-set autodetection.
211+
212+
```python
213+
import httpx
214+
import chardet
215+
216+
def autodetect(content):
217+
return chardet.detect(content).get("encoding")
218+
219+
# Using a client with character-set autodetection enabled.
220+
client = httpx.Client(default_encoding=autodetect)
221+
response = client.get(...)
222+
print(response.encoding) # This will either print the charset given in
223+
# the Content-Type charset, or else the auto-detected
224+
# character set.
225+
print(response.text)
226+
```
227+
228+
---
229+
148230
## Calling into Python Web Apps
149231

150232
You can configure an `httpx` client to call directly into a Python web application using the WSGI protocol.

docs/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,6 @@ The HTTPX project relies on these excellent libraries:
109109
* `httpcore` - The underlying transport implementation for `httpx`.
110110
* `h11` - HTTP/1.1 support.
111111
* `certifi` - SSL certificates.
112-
* `charset_normalizer` - Charset auto-detection.
113112
* `rfc3986` - URL parsing & normalization.
114113
* `idna` - Internationalized domain name support.
115114
* `sniffio` - Async library autodetection.

httpx/_client.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ def __init__(
168168
] = None,
169169
base_url: URLTypes = "",
170170
trust_env: bool = True,
171+
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
171172
):
172173
event_hooks = {} if event_hooks is None else event_hooks
173174

@@ -185,6 +186,7 @@ def __init__(
185186
"response": list(event_hooks.get("response", [])),
186187
}
187188
self._trust_env = trust_env
189+
self._default_encoding = default_encoding
188190
self._netrc = NetRCInfo()
189191
self._state = ClientState.UNOPENED
190192

@@ -611,6 +613,9 @@ class Client(BaseClient):
611613
rather than sending actual network requests.
612614
* **trust_env** - *(optional)* Enables or disables usage of environment
613615
variables for configuration.
616+
* **default_encoding** - *(optional)* The default encoding to use for decoding
617+
response text, if no charset information is included in a response Content-Type
618+
header. Set to a callable for automatic character set detection. Default: "utf-8".
614619
"""
615620

616621
def __init__(
@@ -637,6 +642,7 @@ def __init__(
637642
transport: typing.Optional[BaseTransport] = None,
638643
app: typing.Optional[typing.Callable] = None,
639644
trust_env: bool = True,
645+
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
640646
):
641647
super().__init__(
642648
auth=auth,
@@ -649,6 +655,7 @@ def __init__(
649655
event_hooks=event_hooks,
650656
base_url=base_url,
651657
trust_env=trust_env,
658+
default_encoding=default_encoding,
652659
)
653660

654661
if http2:
@@ -1002,6 +1009,7 @@ def _send_single_request(self, request: Request) -> Response:
10021009
response.stream, response=response, timer=timer
10031010
)
10041011
self.cookies.extract_cookies(response)
1012+
response.default_encoding = self._default_encoding
10051013

10061014
status = f"{response.status_code} {response.reason_phrase}"
10071015
response_line = f"{response.http_version} {status}"
@@ -1326,6 +1334,9 @@ class AsyncClient(BaseClient):
13261334
rather than sending actual network requests.
13271335
* **trust_env** - *(optional)* Enables or disables usage of environment
13281336
variables for configuration.
1337+
* **default_encoding** - *(optional)* The default encoding to use for decoding
1338+
response text, if no charset information is included in a response Content-Type
1339+
header. Set to a callable for automatic character set detection. Default: "utf-8".
13291340
"""
13301341

13311342
def __init__(
@@ -1352,6 +1363,7 @@ def __init__(
13521363
transport: typing.Optional[AsyncBaseTransport] = None,
13531364
app: typing.Optional[typing.Callable] = None,
13541365
trust_env: bool = True,
1366+
default_encoding: str = "utf-8",
13551367
):
13561368
super().__init__(
13571369
auth=auth,
@@ -1364,6 +1376,7 @@ def __init__(
13641376
event_hooks=event_hooks,
13651377
base_url=base_url,
13661378
trust_env=trust_env,
1379+
default_encoding=default_encoding,
13671380
)
13681381

13691382
if http2:
@@ -1708,6 +1721,7 @@ async def _send_single_request(self, request: Request) -> Response:
17081721
response.stream, response=response, timer=timer
17091722
)
17101723
self.cookies.extract_cookies(response)
1724+
response.default_encoding = self._default_encoding
17111725

17121726
status = f"{response.status_code} {response.reason_phrase}"
17131727
response_line = f"{response.http_version} {status}"

httpx/_models.py

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@
77
from collections.abc import MutableMapping
88
from http.cookiejar import Cookie, CookieJar
99

10-
import charset_normalizer
11-
1210
from ._content import ByteStream, UnattachedStream, encode_request, encode_response
1311
from ._decoders import (
1412
SUPPORTED_DECODERS,
@@ -445,6 +443,7 @@ def __init__(
445443
request: typing.Optional[Request] = None,
446444
extensions: typing.Optional[dict] = None,
447445
history: typing.Optional[typing.List["Response"]] = None,
446+
default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
448447
):
449448
self.status_code = status_code
450449
self.headers = Headers(headers)
@@ -461,6 +460,8 @@ def __init__(
461460
self.is_closed = False
462461
self.is_stream_consumed = False
463462

463+
self.default_encoding = default_encoding
464+
464465
if stream is None:
465466
headers, stream = encode_response(content, text, html, json)
466467
self._prepare(headers)
@@ -569,14 +570,18 @@ def encoding(self) -> typing.Optional[str]:
569570
570571
* `.encoding = <>` has been set explicitly.
571572
* The encoding as specified by the charset parameter in the Content-Type header.
572-
* The encoding as determined by `charset_normalizer`.
573-
* UTF-8.
573+
* The encoding as determined by `default_encoding`, which may either be
574+
a string like "utf-8" indicating the encoding to use, or may be a callable
575+
which enables charset autodetection.
574576
"""
575577
if not hasattr(self, "_encoding"):
576578
encoding = self.charset_encoding
577579
if encoding is None or not is_known_encoding(encoding):
578-
encoding = self.apparent_encoding
579-
self._encoding = encoding
580+
if isinstance(self.default_encoding, str):
581+
encoding = self.default_encoding
582+
elif hasattr(self, "_content"):
583+
encoding = self.default_encoding(self._content)
584+
self._encoding = encoding or "utf-8"
580585
return self._encoding
581586

582587
@encoding.setter
@@ -598,19 +603,6 @@ def charset_encoding(self) -> typing.Optional[str]:
598603

599604
return params["charset"].strip("'\"")
600605

601-
@property
602-
def apparent_encoding(self) -> typing.Optional[str]:
603-
"""
604-
Return the encoding, as determined by `charset_normalizer`.
605-
"""
606-
content = getattr(self, "_content", b"")
607-
if len(content) < 32:
608-
# charset_normalizer will issue warnings if we run it with
609-
# fewer bytes than this cutoff.
610-
return None
611-
match = charset_normalizer.from_bytes(self.content).best()
612-
return None if match is None else match.encoding
613-
614606
def _get_content_decoder(self) -> ContentDecoder:
615607
"""
616608
Returns a decoder instance which can be used to decode the raw byte

requirements.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@
44
# Reference: https://github.com/encode/httpx/pull/1721#discussion_r661241588
55
-e .[brotli,cli,http2,socks]
66

7-
charset-normalizer==2.0.6
7+
# Optional charset auto-detection
8+
# Used in our test cases
9+
chardet==4.0.0
10+
types-chardet==4.0.4
811

912
# Documentation
1013
mkdocs==1.3.0

setup.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,6 @@ def get_packages(package):
5757
zip_safe=False,
5858
install_requires=[
5959
"certifi",
60-
"charset_normalizer",
6160
"sniffio",
6261
"rfc3986[idna2008]>=1.3,<2",
6362
"httpcore>=0.15.0,<0.16.0",

tests/client/test_client.py

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
import typing
22
from datetime import timedelta
33

4+
import chardet
45
import pytest
56

67
import httpx
78

89

10+
def autodetect(content):
11+
return chardet.detect(content).get("encoding")
12+
13+
914
def test_get(server):
1015
url = server.url
1116
with httpx.Client(http2=True) as http:
@@ -15,7 +20,7 @@ def test_get(server):
1520
assert response.content == b"Hello, world!"
1621
assert response.text == "Hello, world!"
1722
assert response.http_version == "HTTP/1.1"
18-
assert response.encoding is None
23+
assert response.encoding == "utf-8"
1924
assert response.request.url == url
2025
assert response.headers
2126
assert response.is_redirect is False
@@ -398,3 +403,58 @@ def test_server_extensions(server):
398403
response = client.get(url)
399404
assert response.status_code == 200
400405
assert response.extensions["http_version"] == b"HTTP/1.1"
406+
407+
408+
def test_client_decode_text_using_autodetect():
409+
# Ensure that a 'default_encoding=autodetect' on the response allows for
410+
# encoding autodetection to be used when no "Content-Type: text/plain; charset=..."
411+
# info is present.
412+
#
413+
# Here we have some french text encoded with ISO-8859-1, rather than UTF-8.
414+
text = (
415+
"Non-seulement Despréaux ne se trompait pas, mais de tous les écrivains "
416+
"que la France a produits, sans excepter Voltaire lui-même, imprégné de "
417+
"l'esprit anglais par son séjour à Londres, c'est incontestablement "
418+
"Molière ou Poquelin qui reproduit avec l'exactitude la plus vive et la "
419+
"plus complète le fond du génie français."
420+
)
421+
422+
def cp1252_but_no_content_type(request):
423+
content = text.encode("ISO-8859-1")
424+
return httpx.Response(200, content=content)
425+
426+
transport = httpx.MockTransport(cp1252_but_no_content_type)
427+
with httpx.Client(transport=transport, default_encoding=autodetect) as client:
428+
response = client.get("http://www.example.com")
429+
430+
assert response.status_code == 200
431+
assert response.reason_phrase == "OK"
432+
assert response.encoding == "ISO-8859-1"
433+
assert response.text == text
434+
435+
436+
def test_client_decode_text_using_explicit_encoding():
437+
# Ensure that a 'default_encoding="..."' on the response is used for text decoding
438+
# when no "Content-Type: text/plain; charset=..."" info is present.
439+
#
440+
# Here we have some french text encoded with ISO-8859-1, rather than UTF-8.
441+
text = (
442+
"Non-seulement Despréaux ne se trompait pas, mais de tous les écrivains "
443+
"que la France a produits, sans excepter Voltaire lui-même, imprégné de "
444+
"l'esprit anglais par son séjour à Londres, c'est incontestablement "
445+
"Molière ou Poquelin qui reproduit avec l'exactitude la plus vive et la "
446+
"plus complète le fond du génie français."
447+
)
448+
449+
def cp1252_but_no_content_type(request):
450+
content = text.encode("ISO-8859-1")
451+
return httpx.Response(200, content=content)
452+
453+
transport = httpx.MockTransport(cp1252_but_no_content_type)
454+
with httpx.Client(transport=transport, default_encoding=autodetect) as client:
455+
response = client.get("http://www.example.com")
456+
457+
assert response.status_code == 200
458+
assert response.reason_phrase == "OK"
459+
assert response.encoding == "ISO-8859-1"
460+
assert response.text == text

0 commit comments

Comments
 (0)