-
Notifications
You must be signed in to change notification settings - Fork 49
Manually initialize GcBox contents post-allocation to reduce memory copying #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c861762
to
05def4b
Compare
Uhh... don't think the test failure is related to my PR? |
Sorry I've been so behind checking on my PRs, I'm trying to catch up. First of all, sorry for taking so long on this, this is a really good catch and I'm happy that somebody is looking out for these sorts of things. This is a huge improvement! I only have one suggestion and one half suggestion / half question. The first suggestion is easy: we need a comment explaining why we're initializing the struct this way as opposed to the obvious thing. The second one is more of a question, what happens if we just initialize the
It's possible I'm making too big a deal about it, but I really get scared at initializing the individual fields of a struct with (So by the time I'd actually gotten around to it, something interesting seems to have happened. On the latest rust nightly, this godbolt example https://godbolt.org/z/aaK75W looks a lot different and seems to generate pretty good code for both the old and new versions? I know that the optimizations were delicate in the first place, and rust 1.51 still shows the issue so I'm going to do the following test using rust 1.51.) https://godbolt.org/z/fGWbq6vWh It looks like in godbolt if we do this and compare the two results that the full struct initialization is actually maybe very slightly better? I haven't tested this though using actual ruffle, so it's possible that the optimizations that lead to this are very delicate and the per field initialization is better across more compilers, but I figured I should mention it. I'm much less uneasy about this version. It also looks slightly better under wasm32 as well, but I'm really only skimming the x86_64 asm / wasm32 output, it's very possible I'm missing something. Anyway, whichever way, I'm sure you've thought about this much more than I have, so I'm fine with merging either version as long as there's at least a short comment about it. |
This is a slightly modified version of PR #14
I've maybe addressed this in 68bb1ce, but if that still generates extremely sub-optimal code for anyone, feel free to open another PR with the original technique. I don't think this is as pressing of an issue for ruffle anyway because I believe ruffle is now using a gc-arena fork? So in that case I'm going to go ahead and close this PR. |
Hey, sorry for not responding. I didn't get to come back to this PR yet :(
Not currently, no. Anyway, I'll try to come back to this soon, check if your commit helped and - if needed - make a new PR. |
Wait is that "no it's not currently a pressing issue" or "no ruffle is not yet using a gc-arena fork"?
Trust me, it's okay, it took me plenty long to get to this PR in the first place! |
Both? :) |
Okay that's good to know, I thought that had already happened. I'll definitely try to stay on top of this crate to the extent I can then in the meantime, and the offer is still open for other things like co maintainership or ownership transfer to ruffle-rs. |
Actually, I was wrong, I didn't notice it was already forked :( Anyway, I tested your new changes. (btw, you can now mostly ignore the original code dumps from the PR, most are from before Rust 1.52, which significantly improved codegen here) Before the topic of If I were to guess, it's because Back to
You can observe the same when you remove BTW, if you think that instead of messing with this in your repo, this should rather be considered a rustc/llvm bug, I can report this to rustc instead. (I guess I should report it either way?) EDIT: rust-lang/rust#85094 |
Ideally, when calling
You'd assume
data
to be constructed in-place or moved into newly allocated memory. (the first one isn't really common as stable Rust lacks placement-new-like features). And with the struct being relatively big, you'd expect the compiler to generate amemcpy
call to simply move the structure's bytes into place.The issue currently is that due to either rustc not being smart enough or the gc-arena code not being optimizer friendly (or both), the compiler can
memcpy
your Data object several times before actually moving it into its final place.For example here:
The generated code will firstly do
memcpy
to movet
into thegc_box
object on stack, then allocate memory, and then do the secondmemcpy
to move thegc_box
object onto heap memory. For some reason, on wasm target the compiler is even worse at optimizing this; at the worst case, I've seen fourmemcpy
calls for a single GC allocation. This can obviously cause unnecessary overhead.My patch helps the compiler by simplifying the initialization - first we allocate the uninitialized memory, then we manually build the
GcBox
by moving its fields into place. This way the objectt
is moved straight into its final place without being moved into intermediate stack variablegc_box
.I was trying to show a comparison on godbolt, but as soon as I drop some layers of abstractions, rustc catches on and generates better code. This is my best attempt: https://godbolt.org/z/aaK75W . You can see that in
old()
there is onememcpy
before allocation and one after, but innew()
there is only onememcpy
.Here's a comparison on "production" code, with a decompiled wasm build of https://github.com/ruffle-rs/ruffle/ . In practice, I've seen this cause up to 15-20% speedups in some edge cases.
Before, 4x
memcpy
:After, just two:
And when rust-lang/rust#82806 gets merged into Rustc , with my patch it'll become just one, how it's supposed to work :)
I made sure the patch passes tests with miri.