Skip to content

Commit b57b836

Browse files
[stdlib] Update UTF8Span documentation (#83418)
Amend formatting of `Substring.utf8Span` example code. Use DocC tables in `Unicode.UTF8.ValidationError` overview. --------- Co-authored-by: Alex Martini <[email protected]>
1 parent d63bbb9 commit b57b836

File tree

7 files changed

+204
-117
lines changed

7 files changed

+204
-117
lines changed

stdlib/public/core/UTF8EncodingError.swift

Lines changed: 30 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,33 @@
1+
//===----------------------------------------------------------------------===//
2+
//
3+
// This source file is part of the Swift.org open source project
4+
//
5+
// Copyright (c) 2025 Apple Inc. and the Swift project authors
6+
// Licensed under Apache License v2.0 with Runtime Library Exception
7+
//
8+
// See https://swift.org/LICENSE.txt for license information
9+
// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors
10+
//
11+
//===----------------------------------------------------------------------===//
12+
113
extension Unicode.UTF8 {
214
/**
315

416
The kind and location of a UTF-8 encoding error.
517

618
Valid UTF-8 is represented by this table:
719

8-
```
9-
╔════════════════════╦════════╦════════╦════════╦════════╗
10-
║ Scalar value ║ Byte 0 ║ Byte 1 ║ Byte 2 ║ Byte 3 ║
11-
╠════════════════════╬════════╬════════╬════════╬════════╣
12-
║ U+0000..U+007F ║ 00..7F ║ ║ ║ ║
13-
║ U+0080..U+07FF ║ C2..DF ║ 80..BF ║ ║ ║
14-
║ U+0800..U+0FFF ║ E0 ║ A0..BF ║ 80..BF ║ ║
15-
║ U+1000..U+CFFF ║ E1..EC ║ 80..BF ║ 80..BF ║ ║
16-
║ U+D000..U+D7FF ║ ED ║ 80..9F ║ 80..BF ║ ║
17-
║ U+E000..U+FFFF ║ EE..EF ║ 80..BF ║ 80..BF ║ ║
18-
║ U+10000..U+3FFFF ║ F0 ║ 90..BF ║ 80..BF ║ 80..BF ║
19-
║ U+40000..U+FFFFF ║ F1..F3 ║ 80..BF ║ 80..BF ║ 80..BF ║
20-
║ U+100000..U+10FFFF ║ F4 ║ 80..8F ║ 80..BF ║ 80..BF ║
21-
╚════════════════════╩════════╩════════╩════════╩════════╝
22-
```
20+
| Scalar value | Byte 0 | Byte 1 | Byte 2 | Byte 3 |
21+
| ------------------ | ------ | ------ | ------ | ------ |
22+
| U+0000..U+007F | 00..7F | | | |
23+
| U+0080..U+07FF | C2..DF | 80..BF | | |
24+
| U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
25+
| U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
26+
| U+D000..U+D7FF | ED | 80..9F | 80..BF | |
27+
| U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
28+
| U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
29+
| U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
30+
| U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
2331

2432
### Classifying errors
2533

@@ -49,8 +57,8 @@ extension Unicode.UTF8 {
4957
encodings are invalid UTF-8 and can lead to security issues if not
5058
correctly detected:
5159

52-
- https://nvd.nist.gov/vuln/detail/CVE-2008-2938
53-
- https://nvd.nist.gov/vuln/detail/CVE-2000-0884
60+
- <https://nvd.nist.gov/vuln/detail/CVE-2008-2938>
61+
- <https://nvd.nist.gov/vuln/detail/CVE-2000-0884>
5462

5563
An overlong encoding of `NUL`, `0xC0 0x80`, is used in Java's Modified
5664
UTF-8 but is invalid UTF-8. Overlong encoding errors often catch attempts
@@ -85,15 +93,11 @@ extension Unicode.UTF8 {
8593
the reported range. Similarly, constructing a single error for the longest
8694
invalid byte range can be constructed by joining adjacent error ranges.
8795

88-
```
89-
╔═════════════════╦══════╦═════╦═════╦═════╦═════╦═════╦═════╦══════╗
90-
║ ║ 61 ║ F1 ║ 80 ║ 80 ║ E1 ║ 80 ║ C2 ║ 62 ║
91-
╠═════════════════╬══════╬═════╬═════╬═════╬═════╬═════╬═════╬══════╣
92-
║ Longest range ║ U+61 ║ err ║ ║ ║ ║ ║ ║ U+62 ║
93-
║ Maximal subpart ║ U+61 ║ err ║ ║ ║ err ║ ║ err ║ U+62 ║
94-
║ Error per byte ║ U+61 ║ err ║ err ║ err ║ err ║ err ║ err ║ U+62 ║
95-
╚═════════════════╩══════╩═════╩═════╩═════╩═════╩═════╩═════╩══════╝
96-
```
96+
| Algorithm | 61 | F1 | 80 | 80 | E1 | 80 | C2 | 62 |
97+
| --------------- | ---- | --- | --- | --- | --- | --- | --- | ---- |
98+
| Longest range | U+61 | err | | | | | | U+62 |
99+
| Maximal subpart | U+61 | err | | | err | | err | U+62 |
100+
| Error per byte | U+61 | err | err | err | err | err | err | U+62 |
97101

98102
*/
99103
@available(SwiftStdlib 6.2, *)

stdlib/public/core/UTF8Span.swift

Lines changed: 116 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,17 @@
1-
// TODO: comment header
2-
3-
4-
/// A borrowed view into contiguous memory that contains validly-encoded UTF-8 code units.
1+
//===----------------------------------------------------------------------===//
2+
//
3+
// This source file is part of the Swift.org open source project
4+
//
5+
// Copyright (c) 2025 Apple Inc. and the Swift project authors
6+
// Licensed under Apache License v2.0 with Runtime Library Exception
7+
//
8+
// See https://swift.org/LICENSE.txt for license information
9+
// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors
10+
//
11+
//===----------------------------------------------------------------------===//
12+
13+
/// A borrowed view into contiguous memory that contains validly-encoded UTF-8
14+
/// code units.
515
@frozen
616
@safe
717
@available(SwiftStdlib 6.2, *)
@@ -13,12 +23,12 @@ public struct UTF8Span: Copyable, ~Escapable, BitwiseCopyable {
1323
A bit-packed count and flags (such as isASCII)
1424

1525
╔═══════╦═════╦══════════╦═══════╗
16-
║ b63 ║ b62 ║ b61:56 ║ b56:0 ║
26+
║ b63 ║ b62 ║ b61:56 ║ b55:0 ║
1727
╠═══════╬═════╬══════════╬═══════╣
1828
║ ASCII ║ NFC ║ reserved ║ count ║
1929
╚═══════╩═════╩══════════╩═══════╝
2030

21-
ASCII means the contents are known to be all-ASCII (<0x7F).
31+
ASCII means the contents are known to be all-ASCII (<=0x7F).
2232
NFC means contents are known to be in normal form C for fast comparisons.
2333
*/
2434
@usableFromInline
@@ -200,7 +210,8 @@ extension UTF8Span {
200210
extension String {
201211
/// Creates a new string, copying the specified code units.
202212
///
203-
/// This initializer skips UTF-8 validation because `codeUnits` must contain valid UTF-8.
213+
/// This initializer skips UTF-8 validation because `codeUnits` must contain
214+
/// valid UTF-8.
204215
///
205216
/// - Complexity: O(n)
206217
@available(SwiftStdlib 6.2, *)
@@ -241,17 +252,17 @@ extension String {
241252
}
242253

243254
#if !(os(watchOS) && _pointerBitWidth(_32))
244-
/// A UTF8span over the code units that make up this string.
255+
/// A UTF-8 span over the code units that make up this string.
245256
///
246-
/// - Note: In the case of bridged UTF16 String instances (on Apple
247-
/// platforms,) this property transcodes the code units the first time
248-
/// it is called. The transcoded buffer is cached, and subsequent calls
249-
/// to `span` can reuse the buffer.
257+
/// - Note: In the case of bridged UTF-16 string instances (on Apple
258+
/// platforms) this property transcodes the code units the first time
259+
/// it's called. The transcoded buffer is cached, and subsequent calls
260+
/// can reuse the buffer.
250261
///
251-
/// Returns: a `UTF8Span` over the code units of this String.
262+
/// - Returns: A `UTF8Span` over the code units of this string.
252263
///
253-
/// Complexity: O(1) for native UTF8 Strings,
254-
/// amortized O(1) for bridged UTF16 Strings.
264+
/// - Complexity: O(1) for native UTF-8 strings, amortized O(1) for bridged
265+
/// UTF-16 strings.
255266
@available(SwiftStdlib 6.2, *)
256267
public var utf8Span: UTF8Span {
257268
@lifetime(borrow self)
@@ -262,17 +273,17 @@ extension String {
262273
}
263274
}
264275

265-
/// A UTF8span over the code units that make up this string.
276+
/// A UTF-8 span over the code units that make up this string.
266277
///
267-
/// - Note: In the case of bridged UTF16 String instances (on Apple
268-
/// platforms,) this property transcodes the code units the first time
269-
/// it is called. The transcoded buffer is cached, and subsequent calls
270-
/// to `span` can reuse the buffer.
278+
/// - Note: In the case of bridged UTF-16 string instances (on Apple
279+
/// platforms) this property transcodes the code units the first time
280+
/// it's called. The transcoded buffer is cached, and subsequent calls
281+
/// can reuse the buffer.
271282
///
272-
/// Returns: a `UTF8Span` over the code units of this String.
283+
/// - Returns: A `UTF8Span` over the code units of this string.
273284
///
274-
/// Complexity: O(1) for native UTF8 Strings,
275-
/// amortized O(1) for bridged UTF16 Strings.
285+
/// - Complexity: O(1) for native UTF-8 strings, amortized O(1) for bridged
286+
/// UTF-16 strings.
276287
@available(SwiftStdlib 6.2, *)
277288
public var _utf8Span: UTF8Span? {
278289
@_alwaysEmitIntoClient @inline(__always)
@@ -287,18 +298,18 @@ extension String {
287298
fatalError("\(#function) unavailable on 32-bit watchOS")
288299
}
289300

290-
/// A UTF8span over the code units that make up this string.
301+
/// A UTF-8 span over the code units that make up this string.
291302
///
292-
/// - Note: In the case of bridged UTF16 String instances (on Apple
293-
/// platforms,) this property transcodes the code units the first time
294-
/// it is called. The transcoded buffer is cached, and subsequent calls
295-
/// to `span` can reuse the buffer.
303+
/// - Note: In the case of bridged UTF-16 string instances (on Apple
304+
/// platforms) this property transcodes the code units the first time
305+
/// it's called. The transcoded buffer is cached, and subsequent calls
306+
/// can reuse the buffer.
296307
///
297-
/// Returns: a `UTF8Span` over the code units of this String, or `nil`
298-
/// if the String does not have a contiguous representation.
308+
/// - Returns: A `UTF8Span` over the code units of this string, or `nil`
309+
/// if the string does not have a contiguous representation.
299310
///
300-
/// Complexity: O(1) for native UTF8 Strings,
301-
/// amortized O(1) for bridged UTF16 Strings.
311+
/// - Complexity: O(1) for native UTF-8 strings, amortized O(1) for bridged
312+
/// UTF-16 strings.
302313
@available(SwiftStdlib 6.2, *)
303314
public var _utf8Span: UTF8Span? {
304315
@lifetime(borrow self)
@@ -346,27 +357,34 @@ extension Substring {
346357
}
347358

348359
#if !(os(watchOS) && _pointerBitWidth(_32))
349-
/// A UTF8Span over the code units that make up this substring.
360+
/// A UTF-8 span over the code units that make up this substring.
361+
///
362+
/// - Note: In the case of bridged UTF-16 string instances (on Apple
363+
/// platforms) this property needs to transcode the code units every time
364+
/// it's called.
365+
///
366+
/// For example, if `string` has the bridged UTF-16 representation,
367+
/// the following code is accidentally quadratic because of this issue:
350368
///
351-
/// - Note: In the case of bridged UTF16 String instances (on Apple
352-
/// platforms,) this property needs to transcode the code units every time
353-
/// it is called.
354-
/// For example, if `string` has the bridged UTF16 representation,
355369
/// for word in string.split(separator: " ") {
356370
/// useSpan(word.span)
357371
/// }
358-
/// is accidentally quadratic because of this issue. A workaround is to
359-
/// explicitly convert the string into its native UTF8 representation:
360-
/// var nativeString = consume string
361-
/// nativeString.makeContiguousUTF8()
362-
/// for word in nativeString.split(separator: " ") {
363-
/// useSpan(word.span)
364-
/// }
365-
/// This second option has linear time complexity, as expected.
366-
///
367-
/// Returns: a `UTF8Span` over the code units of this Substring.
368-
///
369-
/// Complexity: O(1) for native UTF8 Strings, O(n) for bridged UTF16 Strings.
372+
///
373+
/// A workaround is to explicitly convert the string into its native UTF-8
374+
/// representation:
375+
///
376+
/// var nativeString = consume string
377+
/// nativeString.makeContiguousUTF8()
378+
/// for word in nativeString.split(separator: " ") {
379+
/// useSpan(word.span)
380+
/// }
381+
///
382+
/// This second option has linear time complexity, as expected.
383+
///
384+
/// - Returns: A `UTF8Span` over the code units of this substring.
385+
///
386+
/// - Complexity: O(1) for native UTF-8 strings, O(n) for bridged UTF-16
387+
/// strings.
370388
@available(SwiftStdlib 6.2, *)
371389
public var utf8Span: UTF8Span {
372390
@lifetime(borrow self)
@@ -377,27 +395,34 @@ extension Substring {
377395
}
378396
}
379397

380-
/// A UTF8Span over the code units that make up this substring.
398+
/// A UTF-8 span over the code units that make up this substring.
399+
///
400+
/// - Note: In the case of bridged UTF-16 string instances (on Apple
401+
/// platforms) this property needs to transcode the code units every time
402+
/// it's called.
403+
///
404+
/// For example, if `string` has the bridged UTF-16 representation,
405+
/// the following code is accidentally quadratic because of this issue:
381406
///
382-
/// - Note: In the case of bridged UTF16 String instances (on Apple
383-
/// platforms,) this property needs to transcode the code units every time
384-
/// it is called.
385-
/// For example, if `string` has the bridged UTF16 representation,
386407
/// for word in string.split(separator: " ") {
387408
/// useSpan(word.span)
388409
/// }
389-
/// is accidentally quadratic because of this issue. A workaround is to
390-
/// explicitly convert the string into its native UTF8 representation:
391-
/// var nativeString = consume string
392-
/// nativeString.makeContiguousUTF8()
393-
/// for word in nativeString.split(separator: " ") {
394-
/// useSpan(word.span)
395-
/// }
396-
/// This second option has linear time complexity, as expected.
397-
///
398-
/// Returns: a `UTF8Span` over the code units of this Substring.
399-
///
400-
/// Complexity: O(1) for native UTF8 Strings, O(n) for bridged UTF16 Strings.
410+
///
411+
/// A workaround is to explicitly convert the string into its native UTF-8
412+
/// representation:
413+
///
414+
/// var nativeString = consume string
415+
/// nativeString.makeContiguousUTF8()
416+
/// for word in nativeString.split(separator: " ") {
417+
/// useSpan(word.span)
418+
/// }
419+
///
420+
/// This second option has linear time complexity, as expected.
421+
///
422+
/// - Returns: A `UTF8Span` over the code units of this substring.
423+
///
424+
/// - Complexity: O(1) for native UTF-8 strings, O(n) for bridged UTF-16
425+
/// strings.
401426
@available(SwiftStdlib 6.2, *)
402427
public var _utf8Span: UTF8Span? {
403428
@_alwaysEmitIntoClient @inline(__always)
@@ -412,28 +437,35 @@ extension Substring {
412437
fatalError("\(#function) unavailable on 32-bit watchOS")
413438
}
414439

415-
/// A UTF8Span over the code units that make up this substring.
440+
/// A UTF-8 span over the code units that make up this substring.
441+
///
442+
/// - Note: In the case of bridged UTF-16 string instances (on Apple
443+
/// platforms) this property needs to transcode the code units every time
444+
/// it's called.
445+
///
446+
/// For example, if `string` has the bridged UTF-16 representation,
447+
/// the following code is accidentally quadratic because of this issue:
416448
///
417-
/// - Note: In the case of bridged UTF16 String instances (on Apple
418-
/// platforms,) this property needs to transcode the code units every time
419-
/// it is called.
420-
/// For example, if `string` has the bridged UTF16 representation,
421449
/// for word in string.split(separator: " ") {
422450
/// useSpan(word.span)
423451
/// }
424-
/// is accidentally quadratic because of this issue. A workaround is to
425-
/// explicitly convert the string into its native UTF8 representation:
426-
/// var nativeString = consume string
427-
/// nativeString.makeContiguousUTF8()
428-
/// for word in nativeString.split(separator: " ") {
429-
/// useSpan(word.span)
430-
/// }
431-
/// This second option has linear time complexity, as expected.
432-
///
433-
/// Returns: a `UTF8Span` over the code units of this Substring, or `nil`
434-
/// if the Substring does not have a contiguous representation.
435-
///
436-
/// Complexity: O(1) for native UTF8 Strings, O(n) for bridged UTF16 Strings.
452+
///
453+
/// A workaround is to explicitly convert the string into its native UTF-8
454+
/// representation:
455+
///
456+
/// var nativeString = consume string
457+
/// nativeString.makeContiguousUTF8()
458+
/// for word in nativeString.split(separator: " ") {
459+
/// useSpan(word.span)
460+
/// }
461+
///
462+
/// This second option has linear time complexity, as expected.
463+
///
464+
/// - Returns: A `UTF8Span` over the code units of this substring, or `nil`
465+
/// if the substring does not have a contiguous representation.
466+
///
467+
/// - Complexity: O(1) for native UTF-8 strings, O(n) for bridged UTF-16
468+
/// strings.
437469
@available(SwiftStdlib 6.2, *)
438470
public var _utf8Span: UTF8Span? {
439471
@lifetime(borrow self)

stdlib/public/core/UTF8SpanBits.swift

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
//===----------------------------------------------------------------------===//
2+
//
3+
// This source file is part of the Swift.org open source project
4+
//
5+
// Copyright (c) 2025 Apple Inc. and the Swift project authors
6+
// Licensed under Apache License v2.0 with Runtime Library Exception
7+
//
8+
// See https://swift.org/LICENSE.txt for license information
9+
// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors
10+
//
11+
//===----------------------------------------------------------------------===//
12+
113
@available(SwiftStdlib 6.2, *)
214
extension UTF8Span {
315
/// Returns whether contents are known to be all-ASCII. A return value of

stdlib/public/core/UTF8SpanComparisons.swift

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
1-
// TODO: comment header
2-
1+
//===----------------------------------------------------------------------===//
2+
//
3+
// This source file is part of the Swift.org open source project
4+
//
5+
// Copyright (c) 2025 Apple Inc. and the Swift project authors
6+
// Licensed under Apache License v2.0 with Runtime Library Exception
7+
//
8+
// See https://swift.org/LICENSE.txt for license information
9+
// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors
10+
//
11+
//===----------------------------------------------------------------------===//
312

413
@available(SwiftStdlib 6.2, *)
514
extension UTF8Span {

0 commit comments

Comments
 (0)