-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Description
In julia, we are allowed to construct strings with non-UTF-8 data. Per the docstring for String
:
while they are interpreted as being UTF-8 encoded, they can be composed of any byte sequence. Use isvalid to validate that the underlying byte sequence is valid as UTF-8.
Julia provides a series of functions for indexing a string via codeunits, such as codeunit(str, i) -> UInt8
and codeunits(str::Str) -> Base.CodeUnits
.
However, we cannot use SubString{String}
to build a view over a string, which is indexing non-UTF-8 data by codeunits.
This is surprising, since the underlying struct appears architected to support it:
julia> dump(view("\xa8\xce\xa8", 1:1))
SubString{String}
string: String "\xa8Ψ"
offset: Int64 0
ncodeunits: Int64 1
but we cannot construct it, since the default constructor has been replaced with one taking a start and end character offset.
struct SubString{T<:AbstractString} <: AbstractString
string::T
offset::Int
ncodeunits::Int
function SubString{T}(s::T, i::Int, j::Int) where T<:AbstractString
i ≤ j || return new(s, 0, 0)
@boundscheck begin
checkbounds(s, i:j)
@inbounds isvalid(s, i) || string_index_err(s, i)
@inbounds isvalid(s, j) || string_index_err(s, j)
end
return new(s, i-1, nextind(s,j)-i)
end
end
Can we provide an additional function to allow constructing a SubString{String} via offset
and ncodeunits
, allowing a SubString to not refer to a valid utf-8 string?