Skip to content

Files that have Unicode chars only after the 1024th byte get their Unicode mangled #309

@petertseng

Description

@petertseng

The situation is that I had a file whose first 1024 bytes didn't contain a Unicode character, but then a character after the 1024th byte did. An example of such a file is:

http://exercism.io/submissions/1e341848768141cf8eba94c6af6e55a7

Submitting this file mangles the Unicode character.

In contrast, this file has Unicode in the first 1024 bytes, so it is good (even the Unicode that appears after the first 1024 bytes is good)

http://exercism.io/submissions/dce4e3ddf0294034ad987ce7b86cdb38

(These are just example submissions in Hello World, but this affected my submission for a real exercise too, Counter in xgo)

I tracked this down to readFileAsUTF8String in api/iteration.go. This uses the https://godoc.org/golang.org/x/net/html/charset#DetermineEncoding function to determine the encoding, which reads the first 1024 bytes.

I'm not really sure what's the right solution here. I know that function was created for #182 to solve exercism/exercism#2303 so there obviously is a legitimate reason behind all this, I guess maybe now we just need to figure out how to deal with this case as well. I don't yet have a good solution, so I'll file this first and sleep on it for a bit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions