Speed up iteration with numbers #16687

timholy · 2016-05-31T21:05:40Z

While tracking down a puzzling performance regression with #16260, I discovered that LLVM is remarkably sensitive to how we define start, next, and done for iteration over scalars. Here's how I discovered the problem:

function iter_indexed(n, v)
    s = 0
    for i = 1:n
        for j = 1:1
            @inbounds k = v[j]
            s += k
        end
    end
    s
end

function iter_in(n, v)
    s = 0
    for i = 1:n
        for k in v
            s += k
        end
    end
    s
end

v = 3
iter_indexed(1, v)
iter_in(1, v)
@time 1
@time iter_indexed(10^8, v)
@time iter_in(10^8, v)

Results:

julia> include("/tmp/testloop.jl")
  0.000003 seconds (156 allocations: 9.278 KB)
  0.000002 seconds (6 allocations: 192 bytes)
  0.139031 seconds (6 allocations: 192 bytes)
300000000

Now, this dramatic difference simply indicates that LLVM is eliding the inner loop for iter_indexed but not for iter_in:

julia> @code_llvm iter_indexed(10^5, 3)

define i64 @julia_iter_indexed_50578(i64, i64) #0 {
top:
  %2 = icmp slt i64 %0, 1
  br i1 %2, label %L7, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %3 = mul i64 %1, %0
  br label %L7

L7:                                               ; preds = %if.lr.ph, %top
  %s.0.lcssa = phi i64 [ %3, %if.lr.ph ], [ 0, %top ]
  ret i64 %s.0.lcssa
}

julia> @code_llvm iter_in(10^5, 3)

define i64 @julia_iter_in_50579(i64, i64) #0 {
top:
  %"#temp#1.sroa.0" = alloca i8, align 1
  %2 = icmp slt i64 %0, 1
  br i1 %2, label %L5, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %3 = bitcast i8* %"#temp#1.sroa.0" to i1*
  br label %if

L.loopexit.loopexit:                              ; preds = %if6
  br label %L.loopexit

L.loopexit:                                       ; preds = %L.loopexit.loopexit, %if
  %s.1.lcssa = phi i64 [ %s.010, %if ], [ %9, %L.loopexit.loopexit ]
  %4 = add i64 %"#temp#.09", 1
  %5 = icmp eq i64 %"#temp#.09", %0
  br i1 %5, label %L5.loopexit, label %if

L5.loopexit:                                      ; preds = %L.loopexit
  br label %L5

L5:                                               ; preds = %L5.loopexit, %top
  %s.0.lcssa = phi i64 [ 0, %top ], [ %s.1.lcssa, %L5.loopexit ]
  ret i64 %s.0.lcssa

if:                                               ; preds = %if.lr.ph, %L.loopexit
  %s.010 = phi i64 [ 0, %if.lr.ph ], [ %s.1.lcssa, %L.loopexit ]
  %"#temp#.09" = phi i64 [ 1, %if.lr.ph ], [ %4, %L.loopexit ]
  store i1 false, i1* %3, align 1
  %6 = load i8, i8* %"#temp#1.sroa.0", align 1
  %7 = and i8 %6, 1
  %8 = icmp eq i8 %7, 0
  br i1 %8, label %if6.preheader, label %L.loopexit

if6.preheader:                                    ; preds = %if
  br label %if6

if6:                                              ; preds = %if6.preheader, %if6
  %s.18 = phi i64 [ %9, %if6 ], [ %s.010, %if6.preheader ]
  store i1 true, i1* %3, align 1
  %9 = add i64 %s.18, %1
  %10 = load i8, i8* %"#temp#1.sroa.0", align 1
  %11 = and i8 %10, 1
  %12 = icmp eq i8 %11, 0
  br i1 %12, label %if6, label %L.loopexit.loopexit
}

Based on this, it's worth testing two ways of declaring iteration over a number:

module FixLoop

immutable Number1{T}
    val::T
end

immutable Number2{T}
    val::T
end

# Here's how master declares iteration over numbers now:
Base.start(::Number1) = false
Base.done(::Number1, state) = state
Base.next(n::Number1, state) = n.val, true

# This PR:
Base.start(::Number2) = 0
Base.done(::Number2, state) = state == 1
Base.next(n::Number2, state) = n.val, state+1

end

n1 = FixLoop.Number1(3)
n2 = FixLoop.Number2(3)
iter_in(1, n1)
iter_in(1, n2)
@time iter_in(10^8, n1)
@time iter_in(10^8, n2)

with results

julia> include("/tmp/fixloop.jl")
WARNING: replacing module FixLoop
  0.121019 seconds (6 allocations: 192 bytes)
  0.000002 seconds (6 allocations: 192 bytes)
300000000

Jutho · 2016-05-31T21:27:15Z

Might this be a regression? It doesn't happen in v0.4.5.

vtjnash · 2016-06-01T04:24:26Z

Might this be a regression? It doesn't happen in v0.4.5.

Yes, this appears to be an LLM regression since it also doesn't happen with llvm 3.3 on master

timholy · 2016-06-01T09:08:29Z

How does one go about reporting such things upstream? I'm presuming that a few lines of Julia code that demonstrate the problem won't quite cut it. Since the code returned by @code_llvm appears to be something that has already passed through an optimizer, is there a good way to capture the initial input?

yuyichao · 2016-06-01T11:36:21Z

Run it with -O0.

vtjnash · 2016-06-01T15:06:37Z

Usually upstream wants a .ll file:
open("code.ll", "w") do io; code_llvm(io, f, args, #=strip=#false, #=module=#true); end

Which you can then test outside of julia with llc (from julia/usr/bin) and look at the effects of various optimizations levels (-O1/2/3), etc. on the resulting assembly and intermediate IR (-print-after-all)

timholy · 2016-06-01T18:59:52Z

Thanks for the great tips. I wrote the .ll files using -O0 as a julia option. But in playing with llc, I could neither see any difference in the resulting assembly depending on the optimization level (above -O0), nor (assuming I'm reading this correctly) get the "fast" version to elide the loop.

Examples:

tim@diva:/tmp$ ~/src/julia-0.5/usr/bin/llc -O3 -o fast_3 -print-after-all loop_fast.ll 2>intermed_fast_3
tim@diva:/tmp$ ~/src/julia-0.5/usr/bin/llc -O1 -o fast_1 -print-after-all loop_fast.ll 2>intermed_fast_1
tim@diva:/tmp$ cmp fast_1 fast_3
tim@diva:/tmp$ cmp intermed_1 intermed_3

I posted a gist with intermed_fast_2 and intermed_slow_2 here

I'm sure I'm being a noob about this, so apologies in advance. But I'm also wondering, are we certain this issue is purely an LLVM issue? Or is there some pass I have to turn on explicitly in llc?

tkelman · 2016-06-04T02:59:41Z

This needs to be debugged and isolated a bit more.

timholy · 2016-06-04T03:19:09Z

I agree, I'll file an issue. But it's too separate, with too simple of a workaround, to let it derail my current task.

tkelman · 2016-06-04T03:41:10Z

yeah that's fine. workaround was simple in this case that you noticed, but how much other code might be affected by the same underlying problem?

timholy · 2016-06-04T07:38:45Z

I'm not disagreeing in the slightest (it's much of why I filed the issue).

timholy mentioned this pull request May 31, 2016

Test performance of "iteration" over numbers JuliaCI/BaseBenchmarks.jl#13

Merged

Speed up iteration with numbers

526695c

timholy force-pushed the teh/iter_number branch from 3228a79 to 526695c Compare June 3, 2016 20:58

timholy merged commit cd2a278 into master Jun 4, 2016

timholy deleted the teh/iter_number branch June 4, 2016 02:39

timholy mentioned this pull request Jun 4, 2016

Determine the cause of codegen regression for iteration with numbers #16753

Closed

tkelman mentioned this pull request Jun 17, 2016

RFC: Nullables as collections #16961

Merged

This was referenced Jul 1, 2016

fix #17098, bad codegen from too many i1 casts #17225

Merged

Revert "Speed up iteration with numbers" #17230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speed up iteration with numbers #16687

Speed up iteration with numbers #16687

Uh oh!

timholy commented May 31, 2016

Uh oh!

Jutho commented May 31, 2016

Uh oh!

vtjnash commented Jun 1, 2016

Uh oh!

timholy commented Jun 1, 2016

Uh oh!

yuyichao commented Jun 1, 2016

Uh oh!

vtjnash commented Jun 1, 2016

Uh oh!

timholy commented Jun 1, 2016

Uh oh!

tkelman commented Jun 4, 2016

Uh oh!

timholy commented Jun 4, 2016

Uh oh!

tkelman commented Jun 4, 2016

Uh oh!

timholy commented Jun 4, 2016

Uh oh!

Uh oh!

Uh oh!

Speed up iteration with numbers #16687

Speed up iteration with numbers #16687

Uh oh!

Conversation

timholy commented May 31, 2016

Uh oh!

Jutho commented May 31, 2016

Uh oh!

vtjnash commented Jun 1, 2016

Uh oh!

timholy commented Jun 1, 2016

Uh oh!

yuyichao commented Jun 1, 2016

Uh oh!

vtjnash commented Jun 1, 2016

Uh oh!

timholy commented Jun 1, 2016

Uh oh!

tkelman commented Jun 4, 2016

Uh oh!

timholy commented Jun 4, 2016

Uh oh!

tkelman commented Jun 4, 2016

Uh oh!

timholy commented Jun 4, 2016

Uh oh!

Uh oh!