Skip to content

Conversation

adrian17
Copy link
Contributor

Ideally, when calling

let data = GcCell::allocate(gc_context, data); // assume  struct Data { x: [u8; 1000] }

You'd assume data to be constructed in-place or moved into newly allocated memory. (the first one isn't really common as stable Rust lacks placement-new-like features). And with the struct being relatively big, you'd expect the compiler to generate a memcpy call to simply move the structure's bytes into place.

The issue currently is that due to either rustc not being smart enough or the gc-arena code not being optimizer friendly (or both), the compiler can memcpy your Data object several times before actually moving it into its final place.

For example here:

        let gc_box = GcBox {
            flags: GcFlags::new(),
            next: Cell::new(self.all.get()),
            value: UnsafeCell::new(t),
        };
        gc_box.flags.set_needs_trace(T::needs_trace());
        let ptr = NonNull::new_unchecked(Box::into_raw(Box::new(gc_box)));

The generated code will firstly do memcpy to move t into the gc_box object on stack, then allocate memory, and then do the second memcpy to move the gc_box object onto heap memory. For some reason, on wasm target the compiler is even worse at optimizing this; at the worst case, I've seen four memcpy calls for a single GC allocation. This can obviously cause unnecessary overhead.

My patch helps the compiler by simplifying the initialization - first we allocate the uninitialized memory, then we manually build the GcBox by moving its fields into place. This way the object t is moved straight into its final place without being moved into intermediate stack variable gc_box.

I was trying to show a comparison on godbolt, but as soon as I drop some layers of abstractions, rustc catches on and generates better code. This is my best attempt: https://godbolt.org/z/aaK75W . You can see that in old() there is one memcpy before allocation and one after, but in new() there is only one memcpy.

Here's a comparison on "production" code, with a decompiled wasm build of https://github.com/ruffle-rs/ruffle/ . In practice, I've seen this cause up to 15-20% speedups in some edge cases.

Before, 4x memcpy:

function gc_arena_context_MutationContext_allocate_h107c2f641d37f0af(a:int, b:int):int {
  var d:int;
  var e:int;
  var c:int = g_a - 368;
  g_a = c;
  memcpy(c + 16, b, 108);
  a[8]:int = (b = a[8]:int + 120);
  if (a[100]:ubyte == 3) {
    if (b <= a[10]:int) goto B_a;
    a[100]:byte = 0;
  }
  a[3]:double = a[3]:double + 120.0 + 120.0 / a[1]:double;
  label B_a:
  c[136]:byte = 0;
  c[16]:long = a[11]:long@4;
  memcpy(c + 140, c + 16, 108);
  gc_arena_types_GcFlags_set_needs_trace_h2ec31421f45c37eb(c + 136, 1); // No clue why it wasn't inlined
  memcpy(c + 248, c + 128, 120);
  b = rust_alloc(120, 4);
  if (b) {
    swf_string_SwfStr_as_core_convert_From_str_from_h2a696e0da16f901c( // <- this is actually `static_gc_box()`, compiler deduplicated functions. No clue why it wasn't inlined.
      c + 8,
      d = memcpy(b, c + 248, 120),
      1231000);
    b = c[2]:int;
    (a + 48)[0]:int = (e = c[3]:int);
    a[11]:int = b;
    if (a[100]:ubyte != 2) goto B_d;
    if (a[15]:int) goto B_d;
    a[16]:int = e;
    a[15]:int = b;
    label B_d:
    g_a = c + 368;
    return d;
  }
  alloc_alloc_handle_alloc_error_h9b35ff53f78b3b72(120, 4);
  return unreachable;
}

After, just two:

function gc_arena_context_MutationContext_allocate_h107c2f641d37f0af(a:int, b:long_ptr@4):int {
  var d:int;
  var c:long_ptr@4 = g_a - 240;
  g_a = c;
  memcpy(c + 8, b, 108);
  a[8]:int = (d = a[8]:int + 120);
  var e:int = 3;
  b = a[100]:ubyte;
  if (b == 3) {
    if (d <= a[10]:int) goto B_a;
    a[100]:byte = 0;
    b = 0;
  }
  a[3]:double = a[3]:double + 120.0 + 120.0 / a[1]:double;
  e = b;
  label B_a:
  b = rust_alloc(120, 4);
  if (b) {
    b[0] = c[30];
    d = b + 8;
    d[0]:int = (c + 128)[0]:int;
    d[0]:byte = 4;
    b[0] = a[11]:long@4;
    memcpy(b + 12, c + 8, 108);
    (a + 48)[0]:int = 1210328;
    a[11]:int = b;
    if (e != 2) goto B_d;
    if (a[15]:int) goto B_d;
    a[16]:int = 1210328;
    a[15]:int = b;
    label B_d:
    g_a = c + 240;
    return b;
  }
  alloc_alloc_handle_alloc_error_h9b35ff53f78b3b72(120, 4);
  return unreachable;
}

And when rust-lang/rust#82806 gets merged into Rustc , with my patch it'll become just one, how it's supposed to work :)

function gc_arena_context_MutationContext_allocate_h107c2f641d37f0af(a:int, b:int):int {
  var e:int;
  var c:int;
  a[8]:int = (e = a[8]:int + 120);
  var d:int = 3;
  c = a[100]:ubyte;
  if (c == 3) {
    if (e <= a[10]:int) goto B_a;
    a[100]:byte = 0;
    c = 0;
  }
  a[3]:double = a[3]:double + 120.0 + 120.0 / a[1]:double;
  d = c;
  label B_a:
  c = rust_alloc(120, 4);
  if (c) {
    c[8]:byte = 4;
    c[0]:long@4 = a[11]:long@4;
    memcpy(c + 12, b, 108);
    (a + 48)[0]:int = 1210328;
    a[11]:int = c;
    if (d != 2) goto B_d;
    if (a[15]:int) goto B_d;
    a[16]:int = 1210328;
    a[15]:int = c;
    label B_d:
    return c;
  }
  alloc_alloc_handle_alloc_error_h9b35ff53f78b3b72(120, 4);
  return unreachable;
}

I made sure the patch passes tests with miri.

@adrian17 adrian17 force-pushed the less-memcpy-in-allocate branch from c861762 to 05def4b Compare March 10, 2021 21:21
@adrian17
Copy link
Contributor Author

Uhh... don't think the test failure is related to my PR?

@kyren
Copy link
Owner

kyren commented Apr 26, 2021

Sorry I've been so behind checking on my PRs, I'm trying to catch up.

First of all, sorry for taking so long on this, this is a really good catch and I'm happy that somebody is looking out for these sorts of things. This is a huge improvement!

I only have one suggestion and one half suggestion / half question. The first suggestion is easy: we need a comment explaining why we're initializing the struct this way as opposed to the obvious thing.

The second one is more of a question, what happens if we just initialize the GcBox pointer like this:

        std::ptr::write(uninitialized.as_mut_ptr(), GcBox {
            flags,
            next: Cell::new(self.all.get()),
            value: UnsafeCell::new(t)
        });

It's possible I'm making too big a deal about it, but I really get scared at initializing the individual fields of a struct with ptr::write, since it would be so easy to change the structure and possibly forget to fully initialize it. Maybe this is a stupid fear because the garbage collection algorithm is unlikely to drastically change? I'm not sure, but it makes me uncomfortable.

(So by the time I'd actually gotten around to it, something interesting seems to have happened. On the latest rust nightly, this godbolt example https://godbolt.org/z/aaK75W looks a lot different and seems to generate pretty good code for both the old and new versions? I know that the optimizations were delicate in the first place, and rust 1.51 still shows the issue so I'm going to do the following test using rust 1.51.)

https://godbolt.org/z/fGWbq6vWh

It looks like in godbolt if we do this and compare the two results that the full struct initialization is actually maybe very slightly better? I haven't tested this though using actual ruffle, so it's possible that the optimizations that lead to this are very delicate and the per field initialization is better across more compilers, but I figured I should mention it. I'm much less uneasy about this version. It also looks slightly better under wasm32 as well, but I'm really only skimming the x86_64 asm / wasm32 output, it's very possible I'm missing something.

Anyway, whichever way, I'm sure you've thought about this much more than I have, so I'm fine with merging either version as long as there's at least a short comment about it.

kyren added a commit that referenced this pull request May 4, 2021
This is a slightly modified version of PR #14
@kyren
Copy link
Owner

kyren commented May 4, 2021

I've maybe addressed this in 68bb1ce, but if that still generates extremely sub-optimal code for anyone, feel free to open another PR with the original technique.

I don't think this is as pressing of an issue for ruffle anyway because I believe ruffle is now using a gc-arena fork? So in that case I'm going to go ahead and close this PR.

@kyren kyren closed this May 4, 2021
@adrian17
Copy link
Contributor Author

adrian17 commented May 4, 2021

Hey, sorry for not responding. I didn't get to come back to this PR yet :(
One issue I had was that the code I was seeing in built binaries was sometimes worse than the minimal reproductions on godbolt, that made it harder to analyze and explain in comments.

I don't think this is as pressing of an issue for ruffle anyway because I believe ruffle is now using a gc-arena fork?

Not currently, no.

Anyway, I'll try to come back to this soon, check if your commit helped and - if needed - make a new PR.

@kyren
Copy link
Owner

kyren commented May 4, 2021

I don't think this is as pressing of an issue for ruffle anyway because I believe ruffle is now using a gc-arena fork?

Not currently, no.

Wait is that "no it's not currently a pressing issue" or "no ruffle is not yet using a gc-arena fork"?

Hey, sorry for not responding. I didn't get to come back to this PR yet :(

Trust me, it's okay, it took me plenty long to get to this PR in the first place!

@adrian17
Copy link
Contributor Author

adrian17 commented May 4, 2021

Wait is that "no it's not currently a pressing issue" or "no ruffle is not yet using a gc-arena fork"?

Both? :)
The general perf issue is a random thing I noticed that measurably impacts only a small number of games so far, so it's not a pressing issue (I also made this PR to possibly help other gc-arena users).
And Ruffle hasn't forked gc-arena yet; longer term we might switch to a fork (or another library), when the time comes to implement more difficult features like weak references.

@kyren
Copy link
Owner

kyren commented May 4, 2021

And Ruffle hasn't forked gc-arena yet; longer term we might switch to a fork (or another library), when the time comes to implement more difficult features like weak references.

Okay that's good to know, I thought that had already happened. I'll definitely try to stay on top of this crate to the extent I can then in the meantime, and the offer is still open for other things like co maintainership or ownership transfer to ruffle-rs.

@adrian17
Copy link
Contributor Author

adrian17 commented May 7, 2021

Actually, I was wrong, I didn't notice it was already forked :(

Anyway, I tested your new changes. (btw, you can now mostly ignore the original code dumps from the PR, most are from before Rust 1.52, which significantly improved codegen here)

Before the topic of memcpy, there's a weird thing with inlining that happened before and I still just barely understand. Our builds on both x64 and wasm are not inlining the trivial functions like GcFlags::set_needs_trace and static_gc_box. This is the same as in the original analysis. Example: https://gist.github.com/adrian17/473242920816c9c55ab0af16706da867

If I were to guess, it's because MutationContext::allocate gets compiled during ruffle_core build (as it's a template instantiation), while GcFlags::set_needs_trace and static_gc_box are not templated so they were compiled into gc-arena, and there's no implicit cross-crate inlining without LTO. This is the reason why my PR added #[inline] to these functions.

Back to memcpy, your simpler optimization unfortunately doesn't seem to have helped. It's kinda hard to make a comparable reproduction in godbolt (as I can't eg reproduce cross-crate inlining issues), but I think I managed to do a good illustration:

  • https://godbolt.org/z/EbaoPY6ae : The original code before my and your changes. You can see MutationContext::allocate calls memcpy twice (you can ignore the third one in f), once before and once after the allocation via rust_alloc;
  • https://godbolt.org/z/Ye7r6efzM : Your change from 4 days ago. There are still two memcpy, the only difference being that they now both occur after the allocation;
  • https://godbolt.org/z/oYrez1c97 : Version from my PR. There's a single memcpy, and there's no huge stack on x86-64 (as in, no sub rsp, 1040).

You can observe the same when you remove --target wasm32-unknown-unknown to build for x86-64. (the second link with your optimization looks like it does a single memcpy in MutationContext::allocate(), but for whatever reason LLVM decided to generate a mov r12, qword ptr [rip + memcpy@GOTPCREL]; call r12; (...) call r12;, so it's really called twice)

BTW, if you think that instead of messing with this in your repo, this should rather be considered a rustc/llvm bug, I can report this to rustc instead. (I guess I should report it either way?)

EDIT: rust-lang/rust#85094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants