Closed
Description
Continuing from #3401, it's clear that the way node.js handles path name encodings is sub-optimal. What is not clear is how to fix it. This issue is for discussing possible solutions.
A quick recap of the current situation:
- node.js assumes UTF-8 in most - but not all - places.
- UTF-8 is fine on Windows. Libuv converts UTF-8 to and from UTF-16, which is what the kernel expects.
- UTF-8 is common but not universal on UNIX systems. Most file systems are character set agnostic, encodings are normally by convention. OS X's HFS+ is the most common exception to the rule.
Considerations:
- Conversions should be zero-byte safe because most C APIs operate on zero-terminated strings.
- JS strings are conceptually always UTF-16 but V8 accepts ISO-8859-1, UTF-8 and UTF-16 as input.
- Conversion (to JS string) from ISO-8859-1 is lossless but conversion from UTF-8 and UTF-16 is not: invalid byte sequences are replaced with U+FFFD.
- Inversely, conversion to UTF-8 and UTF-16 is lossless but conversion to ISO-8859-1 is not: out-of-range characters wrap around - which can be insecure, see the bullet point about C APIs.