Skip to content

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

@NHDaly

Description

@NHDaly

In julia, we are allowed to construct strings with non-UTF-8 data. Per the docstring for String:

while they are interpreted as being UTF-8 encoded, they can be composed of any byte sequence. Use isvalid to validate that the underlying byte sequence is valid as UTF-8.

Julia provides a series of functions for indexing a string via codeunits, such as codeunit(str, i) -> UInt8 and codeunits(str::Str) -> Base.CodeUnits.

However, we cannot use SubString{String} to build a view over a string, which is indexing non-UTF-8 data by codeunits.
This is surprising, since the underlying struct appears architected to support it:

julia> dump(view("\xa8\xce\xa8", 1:1))
SubString{String}
  string: String "\xa8Ψ"
  offset: Int64 0
  ncodeunits: Int64 1

but we cannot construct it, since the default constructor has been replaced with one taking a start and end character offset.

struct SubString{T<:AbstractString} <: AbstractString
    string::T
    offset::Int
    ncodeunits::Int

    function SubString{T}(s::T, i::Int, j::Int) where T<:AbstractString
        i  j || return new(s, 0, 0)
        @boundscheck begin
            checkbounds(s, i:j)
            @inbounds isvalid(s, i) || string_index_err(s, i)
            @inbounds isvalid(s, j) || string_index_err(s, j)
        end
        return new(s, i-1, nextind(s,j)-i)
    end
end

Can we provide an additional function to allow constructing a SubString{String} via offset and ncodeunits, allowing a SubString to not refer to a valid utf-8 string?

Metadata

Metadata

Assignees

No one assigned

    Labels

    designDesign of APIs or of the language itselfstrings"Strings!"

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions