-
Notifications
You must be signed in to change notification settings - Fork 12.7k
sched : copy only the used experts when offloading prompt processing #15346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With models where some experts are often unused, such as gpt-oss
Interesting - I would have expected that a batch of 2048 tokens would pretty much always activate all experts.
const size_t padding = 512; | ||
const size_t padding_end = last_id < input->ne[2] - 1 ? std::min<size_t>(expert_size, padding) : 0; | ||
|
||
ggml_backend_tensor_set_async(split_backend, | ||
input_cpy, | ||
(const uint8_t *)input->data + expert_offset, expert_offset, | ||
// copy a bit extra to ensure there are no NaNs in the padding | ||
expert_size_copy + padding_end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the padding logic and why it is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The padding is necessary for MMQ. Normally the CUDA backend adds the padding at the end of the tensor, but with MoE, the padding of each expert is effectively the beggining of the next expert. We need to ensure that the padding after every the expert does not contain NaNs, and to do that we copy a few bytes from the next expert. It would also be possible to use ggml_backend_tensor_memset
, but this way we avoid an extra call, at the cost of increasing the transfer size by a negligible amount.
I was surprised too, I was only expecting to improve small batch sizes. This is how it looks when processing some code with BS 512blk.0.ffn_gate_exps.weight: 95/128 experts used BS 2048blk.0.ffn_up_exps.weight: 128/128 experts used |
Hello, I was wondering what would be the behavior when using a system with multiGPU and then offloading some experts to other GPUs and to the CPU? i.e., I have this complex offload for DeepSeek
Would be the copy of only the used experts also apply to the GPUs that manage some others experts? |
This only affects experts stored in the CPU. |
When offloading prompt processing to GPU with MoE weights on CPU, all the experts are copied to VRAM. With this change, only the experts that are actually used are copied. With models where some experts are often unused, such as gpt-oss, this can reduce significantly the amount of data that needs to be copied to VRAM.
GGML_CUDA=1 scripts/compare-commits.sh master sl/sched-used-experts llama-bench -m gpt-oss-20b-mxfp4.gguf -ot "exps=CPU" -fa 1 -n 0 -p 2048 -ub "128-2048*2"
GGML_CUDA=1 scripts/compare-commits.sh master sl/sched-used-experts llama-bench -m gpt-oss-120b-mxfp4-00001-of-00003.gguf -ot "exps=CPU" -fa 1 -n 0 -p 2048 -ub "128-2048*2" -r 1
Short tests with real data and respecting the chat template (using
llama-cli -f <prompt.txt>
) also show a significant improvement.