Skip to content

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain#346

Merged
kytrinyx merged 3 commits intoexercism:masterfrom
petertseng:encoding
Sep 26, 2016
Merged

iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain#346
kytrinyx merged 3 commits intoexercism:masterfrom
petertseng:encoding

Conversation

@petertseng
Copy link
Copy Markdown
Member

@petertseng petertseng commented Sep 5, 2016

As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.

DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.

This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.

This commit takes advantage of the fact that DetermineEncoding reports
whether it is certain about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.

Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).

A new fixture file is attached that tests this case - the test fails
without the attached code change.

I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).

The risk of this change is that anyone who actually wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.

Closes #309

Code was expecting five elements, but comment claimed that three
elements was the correct number.
Comment thread api/iteration_test.go Outdated

var buf bytes.Buffer
buf.WriteString("# ")
for i := 0; i < 1024; i++ {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, let's use strings.Repeat instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

An upcoming commit will soon add a test file for which the suffix is
important, in addition to the prefix. The middle of the file will
contain long text that is not desirable to test. Thus, testing suffixes
is helpful.

Note that all current tests have no suffixes, so all cases should pass.
As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.

DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.

This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.

This commit takes advantage of the fact that DetermineEncoding reports
whether it is *certain* about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.

Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).

A new fixture file is attached that tests this case - the test fails
without the attached code change.

I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).

The risk of this change is that anyone who **actually** wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.

Closes #309
@Tonkpils
Copy link
Copy Markdown
Contributor

These changes seem ok to me though I'd be more comfortable if @lcowell or @kytrinyx double checked on these. The test makes me comfortable enough to merge this though I'd feel more confident with a second pair of eyes on this.

@lcowell
Copy link
Copy Markdown
Contributor

lcowell commented Sep 20, 2016

Good explanation of the changes and the test coverage looks good too. Code looks good. The tests pass on my OSX(10.12) machine. Thanks @petertseng for this PR.

@kytrinyx
Copy link
Copy Markdown
Member

I'd say let's go for it.

@kytrinyx
Copy link
Copy Markdown
Member

Going for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Files that have Unicode chars only after the 1024th byte get their Unicode mangled

4 participants