Don't `@inbounds` AbstractArray's iterate method; optimize `checkbounds` instead #58793

mbauman · 2025-06-23T18:56:19Z

Split off from #58785, this simplifies iterate and removes the @inbounds call that was added in #58635. It achieves the same (or better!) performance, however, by targeting optimizations in checkbounds and — in particular — the construction of a linear eachindex (against which the bounds are checked).

giordano · 2025-06-23T18:59:27Z

It achieves the same (or better!) performance

Do you have any benchmarks handy?

mbauman · 2025-06-23T19:06:48Z

Using @jishnub's benchmark:

Nightly @ f61c640 (which has the @inbounds):

julia> using BenchmarkTools
Precompiling BenchmarkTools finished.
  8 dependencies successfully precompiled in 12 seconds. 9 already precompiled.

julia> using LinearAlgebra

julia> A = rand(1000,1000); v2 = view(A, 1:2:lastindex(A));

julia> @btime norm(Iterators.map(splat(-), zip($v2, $v2)));
  1.941 ms (0 allocations: 0 bytes)

vs. b09514e:

julia> @btime norm(Iterators.map(splat(-), zip($v2, $v2)));
  312.167 μs (0 allocations: 0 bytes)

oscardssmith · 2025-06-23T19:24:36Z

base/abstractarray.jl

+@inline iterate(A::AbstractArray, state = iterate_starting_state(A)) = _iterate(A, state)
+@inline function _iterate(A::AbstractArray, state::Tuple)
    y = iterate(state...)
    y === nothing && return nothing
    A[y[1]], (state[1], tail(y)...)
 end
-function _iterate(A::AbstractArray, state::Integer)
+@inline function _iterate(A::AbstractArray, state::Integer)


are these functions really so big that they aren't being inlined automatically?

Before b09514e, yes, they were (well, kinda. length itself was not inlining; had I added @inline to it, though, it'd make this method not inline). That branch was enough to push things over the inlining limit. As I wrote in #58785 (comment),

this calls abstract infrastructure that could be large (and itself @inline'd)... which would then require these generic methods to be similarly @inline.

mbauman · 2025-06-23T19:30:52Z

@nanosoldier runbenchmarks()

giordano · 2025-06-23T20:09:18Z

Test failures (quite a lot) look relevant 😢

This is not valid because it only checks the resulting index into the parent (which we explicitly say we do not check) but skips the checks into the indices (which are the important ones!)

theoretically this is not guaranteed -- indeed an unreachable branch is not currently present for CodeUnits

base/strings/basic.jl

mbauman · 2025-06-25T13:52:31Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2025-06-26T02:14:08Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

JeffBezanson · 2025-06-26T20:47:35Z

Very good PR. Benchmarks are an interesting mixed bag of slowdowns and infinite speedups :) Not sure what to do about it.

mbauman · 2025-06-26T21:09:54Z

The regressions that look meaningful to me are in the iteration of Dict, IdDict and BitSet — those are reporting to be on the order of 1.8x slower. It may be possible to address those; my hunch is that they're hitting the range-length change. That branch is surely constant-folded or statically known in most cases. Maybe adding @inline to that method is the better solution.

mbauman · 2025-06-27T15:34:59Z

OK, I don't think any of the regressions are actually real. I can't reproduce the iteration ones locally, and BenchmarkTools is doing something wildly wrong for the in-place BitSet ones. Just looking at nightly:

julia> using Random, StableRNGs, BenchmarkTools

julia> const RNG = StableRNG(1)
StableRNGs.LehmerRNG(state=0x00000000000000000000000000000003)

julia> const iterlen = 1000;

julia> const ints = rand(RNG, 1:iterlen, iterlen);

julia> c = BitSet(ints);

julia> const newints = [rand(RNG, ints, 10); rand(RNG, 1:iterlen, 10); rand(RNG, iterlen:2iterlen, 10);];

julia> c2 = BitSet(newints);

julia> @benchmark union!(x, $c2) setup=(x=copy($c)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):   0.001 ns … 916.000 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     42.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.112 ns ±  33.456 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄          ▇█          ▂                       ▁▃            ▁
  █▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▇ █
  0.001 ns      Histogram: log(frequency) by time       208 ns <

 Memory estimate: 736 bytes, allocs estimate: 1.

That is... weird.

@nanosoldier runbenchmarks(!("scalar" || "dates" || "io" || "problem" || "inference"), vs = "master")

mbauman · 2025-06-27T15:41:32Z

@nanosoldier runbenchmarks(["array","collection","find","misc","sort","sparse","tuple","union"], vs=":master")

mbauman · 2025-06-27T15:46:22Z

🤷

@nanosoldier runbenchmarks(ALL, vs=":master")

vtjnash · 2025-06-27T16:06:15Z

I don't think nanosoldier knows how to distribute ! over ||, and [ means a single test that matches all of those tags in order

nanosoldier · 2025-06-28T04:08:51Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

mbauman · 2025-06-30T15:43:30Z

That looks much better. I think there's only two (non-scalar) regressions over 1.1x that's on both pages:

ID	time ratio (1)	memory ratio (1)	time ratio (2)	memory ratio (2)
["array", "setindex!", ("setindex!", 3)]	1.29 (5%) ❌	1.00 (1%)	1.17 (5%) ❌	1.00 (1%)
["tuple", "linear algebra", ("matvec", "(4, 4)", "(4,)")]	1.52 (5%) ❌	1.00 (1%)	1.13 (5%) ❌	1.00 (1%)

The perf improvements are stable across the two runs. This is looking good to me.

mbauman · 2025-06-30T16:45:21Z

Just for posterity, as I understand it, there are three separate considerations — each of which help increase the chance that the compiler does something smart here:

Ensuring length inlines (~~here by making a branch do math instead~~)
The iteration implementation here takes the form checkbounds(Bool, x, i) ? (x[i], i+1) : nothing... and x[i] itself should have the exact same checkbounds(Bool, x, i) predicate inside it. Making sure those two predicates match exactly increases the odds that the compiler skips the inner branch entirely. (That's what's behind the CodeUnits change)
Making sure the checkbounds test itself is linear (purely adds and muls) helps LLVM track repeated checkbounds inside a for loop and hoist it out entirely. (That's what moving to unchecked_oneto does)

giordano

I'm not deeply familiar with this part of the code, but changes look sensible, the diff in base (excluding test) is net negative, we get new tests and more consistently better performance, looks like a win-win situation.

base/range.jl

Co-authored-by: N5N3 <[email protected]>

mbauman · 2025-07-01T16:15:57Z

OK, given that the length changes here are relatively ancillary to the main objective around iteration, I've split those off into #58864, and instead I just mark it (and its callers) @inline for now. Given the fundamental importance of the length of a range, I figure that's prudent. Specifically targeting the inlining of this method (and its callers) gets us some of the advantages in the original benchmark. That benchmark is now half-way between what I reported in #58793 (comment):

julia> using BenchmarkTools, LinearAlgebra

julia> A = rand(1000,1000); v2 = view(A, 1:2:lastindex(A));

julia> @btime norm(Iterators.map(splat(-), zip($v2, $v2)));
  931.917 μs (0 allocations: 0 bytes)

giordano and others added 2 commits June 23, 2025 14:16

no inbounds on AbstractArray iterate and axe Array/Memory methods

f831f6f

avoid branch in OrdinalRange length

b09514e

mbauman requested review from giordano and jishnub June 23, 2025 18:56

mbauman added arrays [a, r, r, a, y, s] iteration Involves iteration or the iteration protocol labels Jun 23, 2025

mbauman mentioned this pull request Jun 23, 2025

Unify _checkbounds_array into checkbounds and use it in more places #58785

Merged

oscardssmith reviewed Jun 23, 2025

View reviewed changes

mbauman added 2 commits June 23, 2025 16:22

fix SubArray axes1

9082e68

implement optimized checkbounds for FastSubArray

6d727aa

mbauman changed the title ~~Don't @inbounds in AbstractArray's iterate method~~ Don't @inbounds AbstractArray's iterate method; optimize checkbounds instead Jun 23, 2025

mbauman added 7 commits June 23, 2025 16:54

Remove inbounds on CodeUnits, too

aaa8629

add tests

683f23c

fixup! fix SubArray axes1

f31083e

remove bugged SubArray checkbounds method

74cf796

This is not valid because it only checks the resulting index into the parent (which we explicitly say we do not check) but skips the checks into the indices (which are the important ones!)

Merge remote-tracking branch 'origin/master' into mb+mg/array-iteration

6b601bf

CodeUnits: implement checkbounds and fallback for iterate

6e1afb5

do not test for unreachable with check-bounds=yes

d97e69f

theoretically this is not guaranteed -- indeed an unreachable branch is not currently present for CodeUnits

mbauman commented Jun 24, 2025

View reviewed changes

base/strings/basic.jl Outdated Show resolved Hide resolved

fix ambiguity and more properly constrain checkbounds for CodeUnits

dcb3f55

giordano approved these changes Jun 30, 2025

View reviewed changes

N5N3 reviewed Jul 1, 2025

View reviewed changes

base/range.jl Outdated Show resolved Hide resolved

use flipsign instead of multiplying sign

8d378df

Co-authored-by: N5N3 <[email protected]>

oscardssmith approved these changes Jul 1, 2025

View reviewed changes

defer length(::OrdinalRange) changes; just inline it for now

61e26a0

mbauman mentioned this pull request Jul 1, 2025

optimize length(::OrdinalRange) for large bit-ints #58864

Open

mbauman added the performance Must go faster label Jul 1, 2025

jishnub approved these changes Jul 2, 2025

View reviewed changes

mbauman merged commit e631972 into JuliaLang:master Jul 2, 2025
8 checks passed

mbauman deleted the mb+mg/array-iteration branch July 2, 2025 13:03

Uh oh!

Don't @inbounds AbstractArray's iterate method; optimize checkbounds instead #58793

Don't @inbounds AbstractArray's iterate method; optimize checkbounds instead #58793

Uh oh!

Conversation

mbauman commented Jun 23, 2025

Uh oh!

giordano commented Jun 23, 2025

Uh oh!

mbauman commented Jun 23, 2025

Uh oh!

oscardssmith Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

mbauman Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbauman commented Jun 23, 2025

Uh oh!

giordano commented Jun 23, 2025

Uh oh!

Uh oh!

mbauman commented Jun 25, 2025

Uh oh!

nanosoldier commented Jun 26, 2025

Uh oh!

JeffBezanson commented Jun 26, 2025

Uh oh!

mbauman commented Jun 26, 2025

Uh oh!

mbauman commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbauman commented Jun 27, 2025

Uh oh!

mbauman commented Jun 27, 2025

Uh oh!

vtjnash commented Jun 27, 2025

Uh oh!

nanosoldier commented Jun 28, 2025

Uh oh!

mbauman commented Jun 30, 2025

Uh oh!

mbauman commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbauman commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Don't `@inbounds` AbstractArray's iterate method; optimize `checkbounds` instead #58793

Don't `@inbounds` AbstractArray's iterate method; optimize `checkbounds` instead #58793

mbauman Jun 23, 2025 •

edited

Loading

mbauman commented Jun 27, 2025 •

edited

Loading

mbauman commented Jun 30, 2025 •

edited

Loading

mbauman commented Jul 1, 2025 •

edited

Loading