Skip to content

Commit 6ec6014

Browse files
committed
iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain
As discussed in #309: Since #184 we have been using DetermineEncoding to deal with the case of UTF-16 files. That was a reasonable fix for exercism/exercism#2303. DetermineEncoding only looks at the first 1024 bytes of a file. If it can't determine an encoding, it defaults to windows-1252. This causes undesirable behaviour for files with Unicode characters but also only ASCII in their first 1024 characters - they get interpreted as windows-1252, mangling the Unicode characters. This commit takes advantage of the fact that DetermineEncoding reports whether it is *certain* about its encoding guess. If it is uncertain, we default to UTF-8 instead of windows-1252. Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it is certain. Therefore, behaviour for UTF-16 files is preserved (existing tests would have caught it if behaviour were accidentally altered). A new fixture file is attached that tests this case - the test fails without the attached code change. I find it unlikely that DetermineEncoding would have returned anything other than UTF-16, UTF-8, or windows-1252 since it was made to examine HTML documents and thus examine the content-type (we always pass text/plain) and the meta tags (unlikely to be present in a non-HTML Exercism submission). The risk of this change is that anyone who **actually** wanted to submit in windows-1252 will now be unable to, but I doubt that anyone is in this constituency, and discussion in #309 seems to be in favor of nudging them toward UTF-8 anyway. Closes #309
1 parent 217aaea commit 6ec6014

3 files changed

Lines changed: 45 additions & 3 deletions

File tree

api/iteration.go

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,16 @@ func readFileAsUTF8String(filename string) (*string, error) {
137137
return nil, err
138138
}
139139

140-
encoding, _, _ := charset.DetermineEncoding(b, mimeType)
140+
encoding, _, certain := charset.DetermineEncoding(b, mimeType)
141+
if !certain {
142+
// We don't want to use an uncertain encoding.
143+
// In particular, doing that may mangle UTF-8 files
144+
// that have only ASCII in their first 1024 bytes.
145+
// See https://github.com/exercism/cli/issues/309.
146+
// So if we're unsure, use UTF-8 (no transformation).
147+
s := string(b)
148+
return &s, nil
149+
}
141150
decoder := encoding.NewDecoder()
142151
decodedBytes, _, err := transform.Bytes(decoder, b)
143152
if err != nil {

api/iteration_test.go

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ func TestNewIteration(t *testing.T) {
1818
filepath.Join(dir, "python", "leap", "lib", "three.py"),
1919
filepath.Join(dir, "python", "leap", "utf16le.py"),
2020
filepath.Join(dir, "python", "leap", "utf16be.py"),
21+
filepath.Join(dir, "python", "leap", "long-utf8.py"),
2122
}
2223

2324
iter, err := NewIteration(dir, files)
@@ -32,8 +33,8 @@ func TestNewIteration(t *testing.T) {
3233
t.Errorf("Expected problem to be leap, was %s", iter.Problem)
3334
}
3435

35-
if len(iter.Solution) != 5 {
36-
t.Fatalf("Expected solution to have 5 files, had %d", len(iter.Solution))
36+
if len(iter.Solution) != 6 {
37+
t.Fatalf("Expected solution to have 6 files, had %d", len(iter.Solution))
3738
}
3839

3940
expected := map[string]struct {
@@ -45,6 +46,7 @@ func TestNewIteration(t *testing.T) {
4546
filepath.Join("lib", "three.py"): {prefix: "# three"},
4647
"utf16le.py": {prefix: "# utf16le"},
4748
"utf16be.py": {prefix: "# utf16be"},
49+
"long-utf8.py": {prefix: "# The first 1024", suffix: "👍\n"},
4850
}
4951

5052
for filename, code := range expected {
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# The first 1024 bytes of this file need to contain only ASCII characters.
2+
# After the first 1024 bytes, then there should be a non-ASCII character.
3+
#
4+
# Explanation:
5+
# We use golang.org/x/net/html/charset.DetectEncoding to guess file encoding.
6+
# DetectEncoding checks the first 1024 bytes of a file.
7+
# If it can't determine the encoding and saw no non-ASCII characters,
8+
# it declares the file to have windows-1252 encoding.
9+
# This mangles the submitted file if it should have been UTF-8.
10+
# We test to make sure we use UTF-8 for such files, instead of windows-1252.
11+
12+
lipsum = """
13+
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam condimentum vitae
14+
ipsum eget tempor. Morbi sed ex quis orci vulputate cursus quis non massa.
15+
Vestibulum quam nibh, elementum in justo in, venenatis tristique nisl. Morbi
16+
sagittis elit id velit ultricies, sed rutrum augue posuere. Donec nec nulla nec
17+
eros fringilla pellentesque. Duis at dictum justo. Nunc ut magna felis. Aliquam
18+
volutpat, lectus et molestie porttitor, est orci malesuada erat, ac pretium
19+
eros ligula vel erat. Nullam venenatis dui eget sapien semper lobortis. Aenean
20+
ac eros eget neque porta auctor in nec erat. Phasellus ac nulla ac turpis
21+
porttitor auctor. Etiam eget posuere diam, ac feugiat lacus. Curabitur ornare
22+
justo ut nulla congue, vitae posuere erat venenatis. Aliquam pulvinar eleifend
23+
faucibus.
24+
25+
Etiam justo sem, faucibus malesuada purus a, ultrices efficitur ex.
26+
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac
27+
turpis egestas. Duis maximus dapibus mattis. Quisque sem ex, convallis eu
28+
ultricies posuere.
29+
"""
30+
31+
# 👍

0 commit comments

Comments
 (0)