-
Notifications
You must be signed in to change notification settings - Fork 12.1k
ggml : implement op fusion, starting with REGLU/GEGLU/SWIGLU #14158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that these ops change the shape of the input tensor.
I think it would be better to introduce:
enum ggml_glu_op {
GGML_GLU_OP_REGLU,
GGML_GLU_OP_GEGLU,
GGML_GLU_OP_SWIGLU,
};
// similar to ggml_unary()
GGML_API struct ggml_tensor * ggml_glu(
struct ggml_context * ctx,
struct ggml_tensor * a,
enum ggml_glu_op op);
// these simply call ggml_glu()
GGML_API struct ggml_tensor * ggml_reglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_geglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
GGML_API struct ggml_tensor * ggml_swiglu(
struct ggml_context * ctx,
struct ggml_tensor * a);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope we don't forget to implement these in the rest of the backends.
Adding @JohannesGaessler for review of the CUDA changes.
Yes, let's add the rest of the backends first before merging. At least Metal and Vulkan. |
More generally, I've been thinking that it would be useful to have something like a backend-specific graph optimization step in ggml. That way you could do things like fuse tensors only if the fused tensor is supported by the backend and only if using it makes sense given the tensor shapes. |
Any suggestions on who could help with that? |
ggml-ci
struct ggml_context * ctx, | ||
struct ggml_tensor * a); | ||
|
||
GGML_API struct ggml_tensor * ggml_swiglu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to note that I have been observing one variants of swiglu. it's used by ultravox, which sigmoid the second half of the vector instead of the first half
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting, worth adding a parameter for, or best just handling in conversion?
https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_3-70b/blob/main/ultravox_model.py#L701-L704
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have a param since the GGUFs are already on the internet. Haven't thought about permuting the FFN up tensor before, nice suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added swapped variants.
@ggerganov I didn't dare update metal code, so needs to be implemented there too. :)
@0cc4m @jeffbolznv are either of you interested in a Vulkan implementation? |
I can look into it tomorrow. |
Yep. :) |
Please merge this PR first so that I can adjust the existing kernels for split up and gate. :) I will deduplicate the SYCL code then. |
The plan is to merge #14181 into this one once @ggerganov signs off on it, then backends can be updated, and once all tests go green, merge into master. |
@qnixsynapse If you want you can bring the other branch up-to-date and add your changes there. |
@ggerganov So, I think we are only missing the Metal At this point I think it makes sense to add those separately, then merge this first, then the other PR separately? |
ggml/src/ggml-sycl/element_wise.cpp
Outdated
@@ -649,6 +680,33 @@ static void clamp_sycl(const T *x, T *dst, const float min, | |||
}); | |||
} | |||
|
|||
template<typename T> | |||
static void geglu_sycl(const T * x, const T * g, T * dst, const uint64_t k, const uint64_t n, const uint64_t o, queue_ptr main_stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For geglu_sycl, reglu_sycl & swiglu_sycl the code is identical and I think you can have a single function where the gated op is passed as a template parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is exactly what was done in #14181 :)
The change became much bigger than I initially anticipated and I'm having some second thoughts about this. There was a previous discussion about fusing ops: see #5413 (comment). I am wondering if we should directly implement the approach that @slaren proposed there. |
That sounds like a very good idea, should be possible to refactor this PR without too many changes to backend, I'll have a look! |
Is there a plan for how to handle changes in memory allocation due to fusion? That design feels a bit incomplete. Maybe we should discuss more before you start implementing. |
A very rough suggestion: no new "fused" op. Have a "use count" in the tensor that is incremented during build, so that the backend knows a tensor is only used by one op. Add a function for the backend to report that a tensor doesn't need to be written to memory while the graph is being built. |
Yeah, I can see there are quite a few implementation details that need to be laid out clearly here. First though I think it would make refactoring easier if we merged the other PR into this one, any objections? @ggerganov @0cc4m @qnixsynapse |
No, please do.
Why not just have a device-specific optimization run that checks for fusable ops only if the backend supports them? Or would we not be able to cache that "improved" graph for each repetition of the graph run? Then it might be too expensive. |
I'm just wondering if fusing ops in this particular case (swiglu/geglu/etc) can actually improve the performance in a significant way. The main issue with the non-fused version prior to this PR was that we need So just out of curiosity, can we do a test without |
You can see in the other PR that even though it's not as much we are still gaining 1-3% depending on backend and model, it's not much, but it is something, and it can be a lot more if we go all out and include mul-mat in the fuse. |
Edit: Edit2: @jeffbolznv I think I understand what you meant now, we don't even have to prepare that chain, we can simply keep queuing ops and do the fused op once it gets the longest possible one. |
The backend can just look ahead. Ideally it would be nice to have a use list like in LLVM, but I don't think we need that complexity for now. The backend can just look at the next few nodes. |
BTW, we need a way to disable specific ops from getting fused, f.ex. when running llama.cpp/tools/imatrix/imatrix.cpp Lines 83 to 84 in 745aa53
|
* implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <[email protected]> Co-authored-by: Akarshan <[email protected]>
I wonder if it would make sense for the backends to announce what fusion they can do so that it can be taken into account before the graph is split, or if just performing the fusion at the backend is sufficient/cost-effective enough? |
I think the "fused op" solves all issues with the graph splits and result dependencies. @jeffbolznv What is the reason to prefer not using a dedicated fused operator? |
Imagine we start by fusing A+B, and change the llama frontend to generate a fused A+B op. Then all the backends implement that. So far so good. Now somebody wants to fuse A+B+C. Do they change the frontend to generate a fused A+B+C op? Then they need to change all backends to fuse that or partially replay it, to avoid a perf regression everywhere. As we add more and more fusion, I think this gets unmanageable because nobody is really comfortable changing and testing all the backends. And it doesn't do anything to address avoiding memory allocation for intermediate tensors. |
I think if we have I think we are mainly targeting fusing inplace operations such as additions and multiplications. These don't require additional memory. It's not ideal in that sense if we decide to do more complex fusing in the future, but apart from requiring extra memory, there aren't technical blockers. While the other approach of scanning the graph is technically very difficult to implement correctly because there are many cases to handle, especially with multi-backend support. So I'm still leaning towards the "fused op" approach. Though would be giving this some further thought. |
What happens if one backend wants to fuse ABC (eg mat mul, scale, bias) but another backend wants to fuse BCD (eg scale, bias, activation)? I fear the fused ops will in the long run not match how backends think about these operations (more like an optimizing compiler), and then they all kind of have to unwind them and refuse. Also, if you envision a future where there are lots of ggml applications, it would be a shame if each app has to explicitly opt in to new fusion optimizations by changing their code to adopt the new fused ops. We'll miss out on what could have been free performance if the optimizations happened automatically. |
I'm +1 for the idea of scanning cgraph as @jeffbolznv suggested, but mostly because I think it will provide a better DX overall. A developer already using ggml in their product will be quite confused if we add a dedicated fused op once in a while. Downstream apps may not be able to take advantage of this, as they may not even notice that the op is added. The best case scenario could be to have no change to the public API, while still allow fusing ops internally. I think it can be the same idea as some ops automatically using |
As I think I've said before in this thread, my preferred solution for fused operations would be a backend-specific graph optimization step. This would be applicable to more than just fused operations. For example, data conversions differ between backends and could be moved out of the specific ggml ops into the compute graph. You could then for example re-use the converted data for branching matrix multiplications or move data conversions between backends to minimize the amount of data that needs to be transmitted between them. |
The fused ops are a lot more palatable if they're generated in a backend optimization pass. I think its still worthwhile for the ggml common code to generate a bit more "connectivity" (use list or use count) for the backend to be able to do this more easily. |
Implement REGLU/GEGLU/SWIGLU ops to avoid unnecessary tensor duplications and a little more efficient execution by combining ops in one.Implement op fusion, starting with REGLU/GEGLU/SWIGLU for PoC.
Only CPU and CUDA right now, help needed to complete other backends!