iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain#346
Merged
kytrinyx merged 3 commits intoexercism:masterfrom Sep 26, 2016
petertseng:encoding
Merged
iteration: fall back to UTF-8 (not windows-1252) if encoding uncertain#346kytrinyx merged 3 commits intoexercism:masterfrom petertseng:encoding
kytrinyx merged 3 commits intoexercism:masterfrom
petertseng:encoding
Conversation
Code was expecting five elements, but comment claimed that three elements was the correct number.
|
|
||
| var buf bytes.Buffer | ||
| buf.WriteString("# ") | ||
| for i := 0; i < 1024; i++ { |
Member
Author
There was a problem hiding this comment.
nah, let's use strings.Repeat instead.
An upcoming commit will soon add a test file for which the suffix is important, in addition to the prefix. The middle of the file will contain long text that is not desirable to test. Thus, testing suffixes is helpful. Note that all current tests have no suffixes, so all cases should pass.
As discussed in #309: Since #184 we have been using DetermineEncoding to deal with the case of UTF-16 files. That was a reasonable fix for exercism/exercism#2303. DetermineEncoding only looks at the first 1024 bytes of a file. If it can't determine an encoding, it defaults to windows-1252. This causes undesirable behaviour for files with Unicode characters but also only ASCII in their first 1024 characters - they get interpreted as windows-1252, mangling the Unicode characters. This commit takes advantage of the fact that DetermineEncoding reports whether it is *certain* about its encoding guess. If it is uncertain, we default to UTF-8 instead of windows-1252. Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it is certain. Therefore, behaviour for UTF-16 files is preserved (existing tests would have caught it if behaviour were accidentally altered). A new fixture file is attached that tests this case - the test fails without the attached code change. I find it unlikely that DetermineEncoding would have returned anything other than UTF-16, UTF-8, or windows-1252 since it was made to examine HTML documents and thus examine the content-type (we always pass text/plain) and the meta tags (unlikely to be present in a non-HTML Exercism submission). The risk of this change is that anyone who **actually** wanted to submit in windows-1252 will now be unable to, but I doubt that anyone is in this constituency, and discussion in #309 seems to be in favor of nudging them toward UTF-8 anyway. Closes #309
Contributor
Contributor
|
Good explanation of the changes and the test coverage looks good too. Code looks good. The tests pass on my OSX(10.12) machine. Thanks @petertseng for this PR. |
Member
|
I'd say let's go for it. |
Member
|
Going for it :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As discussed in #309: Since #184 we have been using DetermineEncoding to
deal with the case of UTF-16 files.
That was a reasonable fix for exercism/exercism#2303.
DetermineEncoding only looks at the first 1024 bytes of a file. If it
can't determine an encoding, it defaults to windows-1252.
This causes undesirable behaviour for files with Unicode characters but
also only ASCII in their first 1024 characters - they get interpreted as
windows-1252, mangling the Unicode characters.
This commit takes advantage of the fact that DetermineEncoding reports
whether it is certain about its encoding guess. If it is uncertain, we
default to UTF-8 instead of windows-1252.
Note that if DetermineEncoding sees UTF-16 BOMs, it will declare that it
is certain. Therefore, behaviour for UTF-16 files is preserved (existing
tests would have caught it if behaviour were accidentally altered).
A new fixture file is attached that tests this case - the test fails
without the attached code change.
I find it unlikely that DetermineEncoding would have returned anything
other than UTF-16, UTF-8, or windows-1252 since it was made to examine
HTML documents and thus examine the content-type (we always pass
text/plain) and the meta tags (unlikely to be present in a non-HTML
Exercism submission).
The risk of this change is that anyone who actually wanted to submit
in windows-1252 will now be unable to, but I doubt that anyone is in
this constituency, and discussion in #309 seems to be in favor of
nudging them toward UTF-8 anyway.
Closes #309