Skip to content

Don't @inbounds AbstractArray's iterate method; optimize checkbounds instead #58793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 2, 2025

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Jun 23, 2025

Split off from #58785, this simplifies iterate and removes the @inbounds call that was added in #58635. It achieves the same (or better!) performance, however, by targeting optimizations in checkbounds and — in particular — the construction of a linear eachindex (against which the bounds are checked).

@mbauman mbauman requested review from giordano and jishnub June 23, 2025 18:56
@mbauman mbauman added arrays [a, r, r, a, y, s] iteration Involves iteration or the iteration protocol labels Jun 23, 2025
@giordano
Copy link
Member

It achieves the same (or better!) performance

Do you have any benchmarks handy?

@mbauman
Copy link
Member Author

mbauman commented Jun 23, 2025

Using @jishnub's benchmark:

Nightly @ f61c640 (which has the @inbounds):

julia> using BenchmarkTools
Precompiling BenchmarkTools finished.
  8 dependencies successfully precompiled in 12 seconds. 9 already precompiled.

julia> using LinearAlgebra

julia> A = rand(1000,1000); v2 = view(A, 1:2:lastindex(A));

julia> @btime norm(Iterators.map(splat(-), zip($v2, $v2)));
  1.941 ms (0 allocations: 0 bytes)

vs. b09514e:

julia> @btime norm(Iterators.map(splat(-), zip($v2, $v2)));
  312.167 μs (0 allocations: 0 bytes)

Comment on lines +1241 to +1247
@inline iterate(A::AbstractArray, state = iterate_starting_state(A)) = _iterate(A, state)
@inline function _iterate(A::AbstractArray, state::Tuple)
y = iterate(state...)
y === nothing && return nothing
A[y[1]], (state[1], tail(y)...)
end
function _iterate(A::AbstractArray, state::Integer)
@inline function _iterate(A::AbstractArray, state::Integer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these functions really so big that they aren't being inlined automatically?

Copy link
Member Author

@mbauman mbauman Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before b09514e, yes, they were (well, kinda. length itself was not inlining; had I added @inline to it, though, it'd make this method not inline). That branch was enough to push things over the inlining limit. As I wrote in #58785 (comment),

this calls abstract infrastructure that could be large (and itself @inline'd)... which would then require these generic methods to be similarly @inline.

@mbauman
Copy link
Member Author

mbauman commented Jun 23, 2025

@nanosoldier runbenchmarks()

@giordano
Copy link
Member

Test failures (quite a lot) look relevant 😢

@mbauman mbauman changed the title Don't @inbounds in AbstractArray's iterate method Don't @inbounds AbstractArray's iterate method; optimize checkbounds instead Jun 23, 2025
mbauman added 7 commits June 23, 2025 16:54
This is not valid because it only checks the resulting index into the parent (which we explicitly say we do not check) but skips the checks into the indices (which are the important ones!)
theoretically this is not guaranteed -- indeed an unreachable branch is not currently present for CodeUnits
@mbauman
Copy link
Member Author

mbauman commented Jun 25, 2025

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@JeffBezanson
Copy link
Member

Very good PR. Benchmarks are an interesting mixed bag of slowdowns and infinite speedups :) Not sure what to do about it.

@mbauman
Copy link
Member Author

mbauman commented Jun 26, 2025

The regressions that look meaningful to me are in the iteration of Dict, IdDict and BitSet — those are reporting to be on the order of 1.8x slower. It may be possible to address those; my hunch is that they're hitting the range-length change. That branch is surely constant-folded or statically known in most cases. Maybe adding @inline to that method is the better solution.

@mbauman
Copy link
Member Author

mbauman commented Jun 27, 2025

OK, I don't think any of the regressions are actually real. I can't reproduce the iteration ones locally, and BenchmarkTools is doing something wildly wrong for the in-place BitSet ones. Just looking at nightly:

julia> using Random, StableRNGs, BenchmarkTools

julia> const RNG = StableRNG(1)
StableRNGs.LehmerRNG(state=0x00000000000000000000000000000003)

julia> const iterlen = 1000;

julia> const ints = rand(RNG, 1:iterlen, iterlen);

julia> c = BitSet(ints);

julia> const newints = [rand(RNG, ints, 10); rand(RNG, 1:iterlen, 10); rand(RNG, iterlen:2iterlen, 10);];

julia> c2 = BitSet(newints);

julia> @benchmark union!(x, $c2) setup=(x=copy($c)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):   0.001 ns  916.000 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     42.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.112 ns ±  33.456 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄          ▇█          ▂                       ▁▃            ▁
  █▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▇ █
  0.001 ns      Histogram: log(frequency) by time       208 ns <

 Memory estimate: 736 bytes, allocs estimate: 1.

That is... weird.

@nanosoldier runbenchmarks(!("scalar" || "dates" || "io" || "problem" || "inference"), vs = "master")

@mbauman
Copy link
Member Author

mbauman commented Jun 27, 2025

@nanosoldier runbenchmarks(["array","collection","find","misc","sort","sparse","tuple","union"], vs=":master")

@mbauman
Copy link
Member Author

mbauman commented Jun 27, 2025

🤷

@nanosoldier runbenchmarks(ALL, vs=":master")

@vtjnash
Copy link
Member

vtjnash commented Jun 27, 2025

I don't think nanosoldier knows how to distribute ! over ||, and [ means a single test that matches all of those tags in order

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@mbauman
Copy link
Member Author

mbauman commented Jun 30, 2025

That looks much better. I think there's only two (non-scalar) regressions over 1.1x that's on both pages:

ID time ratio (1) memory ratio (1) time ratio (2) memory ratio (2)
["array", "setindex!", ("setindex!", 3)] 1.29 (5%) ❌ 1.00 (1%) 1.17 (5%) ❌ 1.00 (1%)
["tuple", "linear algebra", ("matvec", "(4, 4)", "(4,)")] 1.52 (5%) ❌ 1.00 (1%) 1.13 (5%) ❌ 1.00 (1%)

The perf improvements are stable across the two runs. This is looking good to me.

@mbauman
Copy link
Member Author

mbauman commented Jun 30, 2025

Just for posterity, as I understand it, there are three separate considerations — each of which help increase the chance that the compiler does something smart here:

  • Ensuring length inlines (here by making a branch do math instead)
  • The iteration implementation here takes the form checkbounds(Bool, x, i) ? (x[i], i+1) : nothing... and x[i] itself should have the exact same checkbounds(Bool, x, i) predicate inside it. Making sure those two predicates match exactly increases the odds that the compiler skips the inner branch entirely. (That's what's behind the CodeUnits change)
  • Making sure the checkbounds test itself is linear (purely adds and muls) helps LLVM track repeated checkbounds inside a for loop and hoist it out entirely. (That's what moving to unchecked_oneto does)

Copy link
Member

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not deeply familiar with this part of the code, but changes look sensible, the diff in base (excluding test) is net negative, we get new tests and more consistently better performance, looks like a win-win situation.

@mbauman
Copy link
Member Author

mbauman commented Jul 1, 2025

OK, given that the length changes here are relatively ancillary to the main objective around iteration, I've split those off into #58864, and instead I just mark it (and its callers) @inline for now. Given the fundamental importance of the length of a range, I figure that's prudent. Specifically targeting the inlining of this method (and its callers) gets us some of the advantages in the original benchmark. That benchmark is now half-way between what I reported in #58793 (comment):

julia> using BenchmarkTools, LinearAlgebra

julia> A = rand(1000,1000); v2 = view(A, 1:2:lastindex(A));

julia> @btime norm(Iterators.map(splat(-), zip($v2, $v2)));
  931.917 μs (0 allocations: 0 bytes)

@mbauman mbauman merged commit e631972 into JuliaLang:master Jul 2, 2025
8 checks passed
@mbauman mbauman deleted the mb+mg/array-iteration branch July 2, 2025 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] iteration Involves iteration or the iteration protocol performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants