Commit 6ec6014
committed
iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain
As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.
DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.
This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.
This commit takes advantage of the fact that DetermineEncoding reports
whether it is *certain* about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.
Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).
A new fixture file is attached that tests this case - the test fails
without the attached code change.
I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).
The risk of this change is that anyone who **actually** wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.
Closes #3091 parent 217aaea commit 6ec6014
3 files changed
Lines changed: 45 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
141 | 150 | | |
142 | 151 | | |
143 | 152 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| |||
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
35 | | - | |
36 | | - | |
| 36 | + | |
| 37 | + | |
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
| |||
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| 49 | + | |
48 | 50 | | |
49 | 51 | | |
50 | 52 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
0 commit comments