Skip to content

Speed up iteration with numbers #16687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2016
Merged

Speed up iteration with numbers #16687

merged 1 commit into from
Jun 4, 2016

Conversation

timholy
Copy link
Member

@timholy timholy commented May 31, 2016

While tracking down a puzzling performance regression with #16260, I discovered that LLVM is remarkably sensitive to how we define start, next, and done for iteration over scalars. Here's how I discovered the problem:

function iter_indexed(n, v)
    s = 0
    for i = 1:n
        for j = 1:1
            @inbounds k = v[j]
            s += k
        end
    end
    s
end

function iter_in(n, v)
    s = 0
    for i = 1:n
        for k in v
            s += k
        end
    end
    s
end

v = 3
iter_indexed(1, v)
iter_in(1, v)
@time 1
@time iter_indexed(10^8, v)
@time iter_in(10^8, v)

Results:

julia> include("/tmp/testloop.jl")
  0.000003 seconds (156 allocations: 9.278 KB)
  0.000002 seconds (6 allocations: 192 bytes)
  0.139031 seconds (6 allocations: 192 bytes)
300000000

Now, this dramatic difference simply indicates that LLVM is eliding the inner loop for iter_indexed but not for iter_in:

julia> @code_llvm iter_indexed(10^5, 3)

define i64 @julia_iter_indexed_50578(i64, i64) #0 {
top:
  %2 = icmp slt i64 %0, 1
  br i1 %2, label %L7, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %3 = mul i64 %1, %0
  br label %L7

L7:                                               ; preds = %if.lr.ph, %top
  %s.0.lcssa = phi i64 [ %3, %if.lr.ph ], [ 0, %top ]
  ret i64 %s.0.lcssa
}

julia> @code_llvm iter_in(10^5, 3)

define i64 @julia_iter_in_50579(i64, i64) #0 {
top:
  %"#temp#1.sroa.0" = alloca i8, align 1
  %2 = icmp slt i64 %0, 1
  br i1 %2, label %L5, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %3 = bitcast i8* %"#temp#1.sroa.0" to i1*
  br label %if

L.loopexit.loopexit:                              ; preds = %if6
  br label %L.loopexit

L.loopexit:                                       ; preds = %L.loopexit.loopexit, %if
  %s.1.lcssa = phi i64 [ %s.010, %if ], [ %9, %L.loopexit.loopexit ]
  %4 = add i64 %"#temp#.09", 1
  %5 = icmp eq i64 %"#temp#.09", %0
  br i1 %5, label %L5.loopexit, label %if

L5.loopexit:                                      ; preds = %L.loopexit
  br label %L5

L5:                                               ; preds = %L5.loopexit, %top
  %s.0.lcssa = phi i64 [ 0, %top ], [ %s.1.lcssa, %L5.loopexit ]
  ret i64 %s.0.lcssa

if:                                               ; preds = %if.lr.ph, %L.loopexit
  %s.010 = phi i64 [ 0, %if.lr.ph ], [ %s.1.lcssa, %L.loopexit ]
  %"#temp#.09" = phi i64 [ 1, %if.lr.ph ], [ %4, %L.loopexit ]
  store i1 false, i1* %3, align 1
  %6 = load i8, i8* %"#temp#1.sroa.0", align 1
  %7 = and i8 %6, 1
  %8 = icmp eq i8 %7, 0
  br i1 %8, label %if6.preheader, label %L.loopexit

if6.preheader:                                    ; preds = %if
  br label %if6

if6:                                              ; preds = %if6.preheader, %if6
  %s.18 = phi i64 [ %9, %if6 ], [ %s.010, %if6.preheader ]
  store i1 true, i1* %3, align 1
  %9 = add i64 %s.18, %1
  %10 = load i8, i8* %"#temp#1.sroa.0", align 1
  %11 = and i8 %10, 1
  %12 = icmp eq i8 %11, 0
  br i1 %12, label %if6, label %L.loopexit.loopexit
}

Based on this, it's worth testing two ways of declaring iteration over a number:

module FixLoop

immutable Number1{T}
    val::T
end

immutable Number2{T}
    val::T
end

# Here's how master declares iteration over numbers now:
Base.start(::Number1) = false
Base.done(::Number1, state) = state
Base.next(n::Number1, state) = n.val, true

# This PR:
Base.start(::Number2) = 0
Base.done(::Number2, state) = state == 1
Base.next(n::Number2, state) = n.val, state+1

end

n1 = FixLoop.Number1(3)
n2 = FixLoop.Number2(3)
iter_in(1, n1)
iter_in(1, n2)
@time iter_in(10^8, n1)
@time iter_in(10^8, n2)

with results

julia> include("/tmp/fixloop.jl")
WARNING: replacing module FixLoop
  0.121019 seconds (6 allocations: 192 bytes)
  0.000002 seconds (6 allocations: 192 bytes)
300000000

@Jutho
Copy link
Contributor

Jutho commented May 31, 2016

Might this be a regression? It doesn't happen in v0.4.5.

@vtjnash
Copy link
Member

vtjnash commented Jun 1, 2016

Might this be a regression? It doesn't happen in v0.4.5.

Yes, this appears to be an LLM regression since it also doesn't happen with llvm 3.3 on master

@timholy
Copy link
Member Author

timholy commented Jun 1, 2016

How does one go about reporting such things upstream? I'm presuming that a few lines of Julia code that demonstrate the problem won't quite cut it. Since the code returned by @code_llvm appears to be something that has already passed through an optimizer, is there a good way to capture the initial input?

@yuyichao
Copy link
Contributor

yuyichao commented Jun 1, 2016

Run it with -O0.

@vtjnash
Copy link
Member

vtjnash commented Jun 1, 2016

Usually upstream wants a .ll file:
open("code.ll", "w") do io; code_llvm(io, f, args, #=strip=#false, #=module=#true); end

Which you can then test outside of julia with llc (from julia/usr/bin) and look at the effects of various optimizations levels (-O1/2/3), etc. on the resulting assembly and intermediate IR (-print-after-all)

@timholy
Copy link
Member Author

timholy commented Jun 1, 2016

Thanks for the great tips. I wrote the .ll files using -O0 as a julia option. But in playing with llc, I could neither see any difference in the resulting assembly depending on the optimization level (above -O0), nor (assuming I'm reading this correctly) get the "fast" version to elide the loop.

Examples:

tim@diva:/tmp$ ~/src/julia-0.5/usr/bin/llc -O3 -o fast_3 -print-after-all loop_fast.ll 2>intermed_fast_3
tim@diva:/tmp$ ~/src/julia-0.5/usr/bin/llc -O1 -o fast_1 -print-after-all loop_fast.ll 2>intermed_fast_1
tim@diva:/tmp$ cmp fast_1 fast_3
tim@diva:/tmp$ cmp intermed_1 intermed_3

I posted a gist with intermed_fast_2 and intermed_slow_2 here

I'm sure I'm being a noob about this, so apologies in advance. But I'm also wondering, are we certain this issue is purely an LLVM issue? Or is there some pass I have to turn on explicitly in llc?

@timholy timholy force-pushed the teh/iter_number branch from 3228a79 to 526695c Compare June 3, 2016 20:58
@timholy timholy merged commit cd2a278 into master Jun 4, 2016
@timholy timholy deleted the teh/iter_number branch June 4, 2016 02:39
@tkelman
Copy link
Contributor

tkelman commented Jun 4, 2016

This needs to be debugged and isolated a bit more.

@timholy
Copy link
Member Author

timholy commented Jun 4, 2016

I agree, I'll file an issue. But it's too separate, with too simple of a workaround, to let it derail my current task.

@tkelman
Copy link
Contributor

tkelman commented Jun 4, 2016

yeah that's fine. workaround was simple in this case that you noticed, but how much other code might be affected by the same underlying problem?

@timholy
Copy link
Member Author

timholy commented Jun 4, 2016

I'm not disagreeing in the slightest (it's much of why I filed the issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants