-
Notifications
You must be signed in to change notification settings - Fork 12.6k
vulkan: conv2d addressing optimizations #15056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Rather than clamping the coordinate and always doing the load, conditionally load the value. - Move some invariant calculations outside of the loop.
I already experimented with moving the invariants outside the loop but I observed no improvement on GTX 1060. I assumed that this is because the compiler can easily recognize instruction dependency and moves these intructions outside the loop so I decided to keep them inside for better readability. It is surprising for me that it helps on more recent devices. Should not we assume that the compiler does this? I intendedly introduced clamping (although it is unintuitive) because I observed that branch divergence really hurts perf on older devices and clamping minimizes the instructions needed to put into the branches and observed some gain compared to conditional loads. This might not hold on recent devices. I am curious whether your modifications help on older devices or not, I will test them. |
I was surprised moving the invariants out of the loop helped, but it seemed to. Putting the load in the branch generates much better code for NVIDIA, doing a predicated load rather than the load + select. |
I did a similar change by moving invariants outside the loops some time ago to the im2col shader and there was a measurable improvement on my AMD gpus so I guess the compiler doesn't optimize that automatically. |
Performance is similar to previous optimization on 1060:
|
I went back and double checked that hoisting the invariant instructions helps - the shader is about 3% fewer instructions with that part of the change. Pascal has so much slower global loads that a few math instructions aren't going to move the needle. |
You mean fewer instructions in the generated code or fewer executed instructions?
That's great. The optimizations in this PR are good for recent Nvidia devices at the cost of more complexity in the host code and sometimes more pipelines. Might be useful in the future to develop infrastructure to autotune the free parameters of the Kernel on each device and cache the best parameters. |
Yeah my RX 470 is not happy with this change, but it's funny how from looking at the code I expect it to be faster. PR:
Master:
Now if I reverse the if else it now becomes a little bit faster than master. I'm too lazy to run this through a profiler but I think the no load condition happens so rarely that any improvements from it are not worth it considering how it likely affects prefetching and compiler optimizations.
|
@netrunnereve Do you have a working profiler for AMD? I'm trying to properly learn and apply GPU profiling, I got it to work on Nvidia, but not yet on AMD, at least not on a shader code level. |
If there's no easy way to resolve the AMD/Intel perf, I'm OK with abandoning this. |
Why would it be better to drop this PR instead of enabling it conditionally on Nvidia devices? |
Just not sure it's worth the additional complexity. |
Yep RGP supports RADV if you have a new Mesa with I run it with something like this: Make sure your /tmp directory is big enough as that's where all the traces go, and run it on a single test only to keep things under control. I've never been able to get it to work properly on a real model. Then you can open the trace with RGP, use version 1 for GCN and version 2 for the new stuff. The wave and event pages work fine in most cases but the instruction timings are often blank as it complains about not finding enough waves. Sometimes regenerating the profile will work but there are some shaders like the conv2d ones which it just straight up doesn't like. ![]() I don't particularly care about the latencies for regular instructions but I look for long waits for memory, the hit count for branches, and the register counts on the side. For actual instruction cycles I use the tables here. |
Is it still faster on Nvidia if you remove the conditional loads? |
I think hoisting the invariants still shows a benefit even with the clamp+select still in place. |
@jeffbolznv I think that we can also introduce a variant that assumes the contiguity of the filter. Maybe the math is so fast with coopmats that even fastdivs can have relatively big impact on the overall performance. I observed significant gains with this before shuffles, might be worth a try on 4090/5090. |
CC @etasnadi