-
Notifications
You must be signed in to change notification settings - Fork 126
Offload only activated experts to the GPU #698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
For qwen 235b:
I copied the PR into a branch and make clean/ccmake ./make |
Thanks for the bug report. If you are willing to test, can you do these two things:
|
Multi gpu crashes like this:
Single GPU runs:
|
Thanks for testing! So, the issue is with multi-GPU, which I cannot test myself. Btw, I did test Qwen3-30B-A3B and observe zero effect. Just in the process of testing with So, it looks like having rarely activated experts is specific to GPT-OSS. Although I can also remember people having trouble to get imatrix data for the large DeepSeek models, which indicates that at least some experts are rarely activated, so possibly one may gain something there. |
Thought some experts in qwen didn't get used as well: https://huggingface.co/kalomaze/Qwen3-16B-A3B Maybe it just doesn't help here. Not a big fan of OSS. Tried it on huggingface chat and decided to skip it when it came out. |
From that repo discussion here it lost basically all Chinese capability.
It was a single expert on a single layer I think, it was Arctic where less diverse datasets got less experts but even with relatively diverse ones there was still a few missing I think. The performance gain was still decreasing with increasing n_ubatch for |
Yep, but if you're not using chinese it should avoid those experts and never activate them. Seems it didn't work out that way in this PR tho. |
This PR is based on PR 15346 in mainline
llama.cpp
. Due to the divergence of the code bases cherry-picking did not work. I also needed to implement the logic for the fusedffn_up+ffn_gate
op, which is not present in mainline, so it isn't just a copy/paste.For hybrid CPU/GPU inference for MoE models, when experts are stored in RAM only the activated experts are copied to the GPU. The change only affects prompt processing (experts stored in RAM are never copied to the GPU for TG), and only activates if the batch size is large enough to trigger GPU offload (which, unlike mainline where experts are copied for batch sizes >= 32, is model dependent and given by
32 * total_experts / active_experts
). The idea is that if a significant fraction of the experts are rarely activated, this could result in a non-negligible performance gains. However, for many models and a batch size large enough to trigger GPU offload, basically all experts are active, so there is no performance gain (and one may observe even a slight performance degradation as the added steps to synchronize with the GPU(s) and copy over the expert IDs do not pay off to reduce the amount of data being copied).Here are some performance comparison examples. Most benchmarks are run on a Ryzen-7950X CPU + RTX-4080 GPU using
-ot exps=CPU
(so all experts are on the CPU). The GPT-OSS-120B case is run on a Ryzen-5975WX + RTX-4080 system (the Ryzen-7950X rig does not have enough RAM for GPT-OSS-120B). In all cases flash attention is on andfmoe = 1
. Only u-batch sizes large enough to trigger GPU offload are included in the tables.GPT-OSS-20B, MXFP4
GPT-OSS-120B, MXFP4
DeepSeek-Lite, Q4_0, mla = 3