Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8.

In julia, we are allowed to construct strings with non-UTF-8 data. Per the docstring for `String`:

> while they are interpreted as being UTF-8 encoded, they can be composed of any byte sequence. Use isvalid to validate that the underlying byte sequence is valid as UTF-8.

Julia provides a series of functions for indexing a string via codeunits, such as `codeunit(str, i) -> UInt8` and `codeunits(str::Str) -> Base.CodeUnits`.

However, we cannot use `SubString{String}` to build a view over a string, which is indexing non-UTF-8 data by codeunits.
This is surprising, since the underlying struct appears architected to support it:
```julia
julia> dump(view("\xa8\xce\xa8", 1:1))
SubString{String}
  string: String "\xa8Ψ"
  offset: Int64 0
  ncodeunits: Int64 1
```
but we cannot _construct it_, since the default constructor has been replaced with one taking a start and end _character offset_.
```julia
struct SubString{T<:AbstractString} <: AbstractString
    string::T
    offset::Int
    ncodeunits::Int

    function SubString{T}(s::T, i::Int, j::Int) where T<:AbstractString
        i ≤ j || return new(s, 0, 0)
        @boundscheck begin
            checkbounds(s, i:j)
            @inbounds isvalid(s, i) || string_index_err(s, i)
            @inbounds isvalid(s, j) || string_index_err(s, j)
        end
        return new(s, i-1, nextind(s,j)-i)
    end
end
```

Can we provide an additional function to allow constructing a SubString{String} via `offset` and `ncodeunits`, allowing a SubString to not refer to a valid utf-8 string?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Allow constructing a SubString{String} with codeunit indexing, even if substring isn't valid UTF-8. #58048

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions