Skip to content

sched : copy only the used experts when offloading prompt processing #15346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

slaren
Copy link
Member

@slaren slaren commented Aug 15, 2025

When offloading prompt processing to GPU with MoE weights on CPU, all the experts are copied to VRAM. With this change, only the experts that are actually used are copied. With models where some experts are often unused, such as gpt-oss, this can reduce significantly the amount of data that needs to be copied to VRAM.

GGML_CUDA=1 scripts/compare-commits.sh master sl/sched-used-experts llama-bench -m gpt-oss-20b-mxfp4.gguf -ot "exps=CPU" -fa 1 -n 0 -p 2048 -ub "128-2048*2"

Model Microbatch size Test t/s master t/s sl/sched-used-experts Speedup
gpt-oss 20B MXFP4 MoE 128 pp2048 124.24 202.92 1.63
gpt-oss 20B MXFP4 MoE 256 pp2048 253.18 372.75 1.47
gpt-oss 20B MXFP4 MoE 512 pp2048 502.73 685.31 1.36
gpt-oss 20B MXFP4 MoE 1024 pp2048 1015.20 1178.50 1.16
gpt-oss 20B MXFP4 MoE 2048 pp2048 1629.92 2089.53 1.28

GGML_CUDA=1 scripts/compare-commits.sh master sl/sched-used-experts llama-bench -m gpt-oss-120b-mxfp4-00001-of-00003.gguf -ot "exps=CPU" -fa 1 -n 0 -p 2048 -ub "128-2048*2" -r 1

Model Microbatch size Test t/s master t/s sl/sched-used-experts Speedup
gpt-oss 120B MXFP4 MoE 128 pp2048 22.37 59.09 2.64
gpt-oss 120B MXFP4 MoE 256 pp2048 41.80 99.83 2.39
gpt-oss 120B MXFP4 MoE 512 pp2048 87.93 182.89 2.08
gpt-oss 120B MXFP4 MoE 1024 pp2048 168.17 255.84 1.52
gpt-oss 120B MXFP4 MoE 2048 pp2048 345.01 483.99 1.40

Short tests with real data and respecting the chat template (using llama-cli -f <prompt.txt>) also show a significant improvement.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 15, 2025
@slaren slaren marked this pull request as ready for review August 15, 2025 14:37
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With models where some experts are often unused, such as gpt-oss

Interesting - I would have expected that a batch of 2048 tokens would pretty much always activate all experts.

Comment on lines +1418 to +1425
const size_t padding = 512;
const size_t padding_end = last_id < input->ne[2] - 1 ? std::min<size_t>(expert_size, padding) : 0;

ggml_backend_tensor_set_async(split_backend,
input_cpy,
(const uint8_t *)input->data + expert_offset, expert_offset,
// copy a bit extra to ensure there are no NaNs in the padding
expert_size_copy + padding_end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the padding logic and why it is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The padding is necessary for MMQ. Normally the CUDA backend adds the padding at the end of the tensor, but with MoE, the padding of each expert is effectively the beggining of the next expert. We need to ensure that the padding after every the expert does not contain NaNs, and to do that we copy a few bytes from the next expert. It would also be possible to use ggml_backend_tensor_memset, but this way we avoid an extra call, at the cost of increasing the transfer size by a negligible amount.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2025

Interesting - I would have expected that a batch of 2048 tokens would pretty much always activate all experts.

I was surprised too, I was only expecting to improve small batch sizes. This is how it looks when processing some code with llama-cli (respecting the chat template):

BS 512

blk.0.ffn_gate_exps.weight: 95/128 experts used
blk.0.ffn_up_exps.weight: 95/128 experts used
blk.0.ffn_down_exps.weight: 95/128 experts used
blk.1.ffn_gate_exps.weight: 91/128 experts used
blk.1.ffn_up_exps.weight: 91/128 experts used
blk.1.ffn_down_exps.weight: 91/128 experts used
blk.2.ffn_gate_exps.weight: 72/128 experts used
blk.2.ffn_up_exps.weight: 72/128 experts used
blk.2.ffn_down_exps.weight: 72/128 experts used
blk.3.ffn_gate_exps.weight: 74/128 experts used
blk.3.ffn_up_exps.weight: 74/128 experts used
blk.3.ffn_down_exps.weight: 74/128 experts used
blk.4.ffn_gate_exps.weight: 73/128 experts used
blk.4.ffn_up_exps.weight: 73/128 experts used
blk.4.ffn_down_exps.weight: 73/128 experts used
blk.5.ffn_gate_exps.weight: 78/128 experts used
blk.5.ffn_up_exps.weight: 78/128 experts used
blk.5.ffn_down_exps.weight: 78/128 experts used
blk.6.ffn_gate_exps.weight: 87/128 experts used
blk.6.ffn_up_exps.weight: 87/128 experts used
blk.6.ffn_down_exps.weight: 87/128 experts used
blk.7.ffn_gate_exps.weight: 95/128 experts used
blk.7.ffn_up_exps.weight: 95/128 experts used
blk.7.ffn_down_exps.weight: 95/128 experts used
blk.8.ffn_gate_exps.weight: 78/128 experts used
blk.8.ffn_up_exps.weight: 78/128 experts used
blk.8.ffn_down_exps.weight: 78/128 experts used
blk.9.ffn_gate_exps.weight: 81/128 experts used
blk.9.ffn_up_exps.weight: 81/128 experts used
blk.9.ffn_down_exps.weight: 81/128 experts used
blk.10.ffn_gate_exps.weight: 70/128 experts used
blk.10.ffn_up_exps.weight: 70/128 experts used
blk.10.ffn_down_exps.weight: 70/128 experts used
blk.11.ffn_gate_exps.weight: 73/128 experts used
blk.11.ffn_up_exps.weight: 73/128 experts used
blk.11.ffn_down_exps.weight: 73/128 experts used
blk.12.ffn_gate_exps.weight: 80/128 experts used
blk.12.ffn_up_exps.weight: 80/128 experts used
blk.12.ffn_down_exps.weight: 80/128 experts used
blk.13.ffn_gate_exps.weight: 77/128 experts used
blk.13.ffn_up_exps.weight: 77/128 experts used
blk.13.ffn_down_exps.weight: 77/128 experts used
blk.14.ffn_gate_exps.weight: 87/128 experts used
blk.14.ffn_up_exps.weight: 87/128 experts used
blk.14.ffn_down_exps.weight: 87/128 experts used
blk.15.ffn_gate_exps.weight: 80/128 experts used
blk.15.ffn_up_exps.weight: 80/128 experts used
blk.15.ffn_down_exps.weight: 80/128 experts used
blk.16.ffn_gate_exps.weight: 70/128 experts used
blk.16.ffn_up_exps.weight: 70/128 experts used
blk.16.ffn_down_exps.weight: 70/128 experts used
blk.17.ffn_gate_exps.weight: 80/128 experts used
blk.17.ffn_up_exps.weight: 80/128 experts used
blk.17.ffn_down_exps.weight: 80/128 experts used
blk.18.ffn_gate_exps.weight: 70/128 experts used
blk.18.ffn_up_exps.weight: 70/128 experts used
blk.18.ffn_down_exps.weight: 70/128 experts used
blk.19.ffn_gate_exps.weight: 77/128 experts used
blk.19.ffn_up_exps.weight: 77/128 experts used
blk.19.ffn_down_exps.weight: 77/128 experts used
blk.20.ffn_gate_exps.weight: 72/128 experts used
blk.20.ffn_up_exps.weight: 72/128 experts used
blk.20.ffn_down_exps.weight: 72/128 experts used
blk.21.ffn_gate_exps.weight: 78/128 experts used
blk.21.ffn_up_exps.weight: 78/128 experts used
blk.21.ffn_down_exps.weight: 78/128 experts used
blk.22.ffn_gate_exps.weight: 60/128 experts used
blk.22.ffn_up_exps.weight: 60/128 experts used
blk.22.ffn_down_exps.weight: 60/128 experts used
blk.23.ffn_gate_exps.weight: 69/128 experts used
blk.23.ffn_up_exps.weight: 69/128 experts used
blk.23.ffn_down_exps.weight: 69/128 experts used
blk.24.ffn_gate_exps.weight: 70/128 experts used
blk.24.ffn_up_exps.weight: 70/128 experts used
blk.24.ffn_down_exps.weight: 70/128 experts used
blk.25.ffn_gate_exps.weight: 70/128 experts used
blk.25.ffn_up_exps.weight: 70/128 experts used
blk.25.ffn_down_exps.weight: 70/128 experts used
blk.26.ffn_gate_exps.weight: 79/128 experts used
blk.26.ffn_up_exps.weight: 79/128 experts used
blk.26.ffn_down_exps.weight: 79/128 experts used
blk.27.ffn_gate_exps.weight: 74/128 experts used
blk.27.ffn_up_exps.weight: 74/128 experts used
blk.27.ffn_down_exps.weight: 74/128 experts used
blk.28.ffn_gate_exps.weight: 74/128 experts used
blk.28.ffn_up_exps.weight: 74/128 experts used
blk.28.ffn_down_exps.weight: 74/128 experts used
blk.29.ffn_gate_exps.weight: 76/128 experts used
blk.29.ffn_up_exps.weight: 76/128 experts used
blk.29.ffn_down_exps.weight: 76/128 experts used
blk.30.ffn_gate_exps.weight: 68/128 experts used
blk.30.ffn_up_exps.weight: 68/128 experts used
blk.30.ffn_down_exps.weight: 68/128 experts used
blk.31.ffn_gate_exps.weight: 81/128 experts used
blk.31.ffn_up_exps.weight: 81/128 experts used
blk.31.ffn_down_exps.weight: 81/128 experts used
blk.32.ffn_gate_exps.weight: 64/128 experts used
blk.32.ffn_up_exps.weight: 64/128 experts used
blk.32.ffn_down_exps.weight: 64/128 experts used
blk.33.ffn_gate_exps.weight: 59/128 experts used
blk.33.ffn_up_exps.weight: 59/128 experts used
blk.33.ffn_down_exps.weight: 59/128 experts used
blk.34.ffn_gate_exps.weight: 50/128 experts used
blk.34.ffn_up_exps.weight: 50/128 experts used
blk.34.ffn_down_exps.weight: 50/128 experts used

BS 2048

blk.0.ffn_up_exps.weight: 128/128 experts used
blk.0.ffn_down_exps.weight: 128/128 experts used
blk.1.ffn_gate_exps.weight: 121/128 experts used
blk.1.ffn_up_exps.weight: 121/128 experts used
blk.1.ffn_down_exps.weight: 121/128 experts used
blk.2.ffn_gate_exps.weight: 116/128 experts used
blk.2.ffn_up_exps.weight: 116/128 experts used
blk.2.ffn_down_exps.weight: 116/128 experts used
blk.3.ffn_gate_exps.weight: 111/128 experts used
blk.3.ffn_up_exps.weight: 111/128 experts used
blk.3.ffn_down_exps.weight: 111/128 experts used
blk.4.ffn_gate_exps.weight: 115/128 experts used
blk.4.ffn_up_exps.weight: 115/128 experts used
blk.4.ffn_down_exps.weight: 115/128 experts used
blk.5.ffn_gate_exps.weight: 120/128 experts used
blk.5.ffn_up_exps.weight: 120/128 experts used
blk.5.ffn_down_exps.weight: 120/128 experts used
blk.6.ffn_gate_exps.weight: 121/128 experts used
blk.6.ffn_up_exps.weight: 121/128 experts used
blk.6.ffn_down_exps.weight: 121/128 experts used
blk.7.ffn_gate_exps.weight: 120/128 experts used
blk.7.ffn_up_exps.weight: 120/128 experts used
blk.7.ffn_down_exps.weight: 120/128 experts used
blk.8.ffn_gate_exps.weight: 119/128 experts used
blk.8.ffn_up_exps.weight: 119/128 experts used
blk.8.ffn_down_exps.weight: 119/128 experts used
blk.9.ffn_gate_exps.weight: 112/128 experts used
blk.9.ffn_up_exps.weight: 112/128 experts used
blk.9.ffn_down_exps.weight: 112/128 experts used
blk.10.ffn_gate_exps.weight: 108/128 experts used
blk.10.ffn_up_exps.weight: 108/128 experts used
blk.10.ffn_down_exps.weight: 108/128 experts used
blk.11.ffn_gate_exps.weight: 102/128 experts used
blk.11.ffn_up_exps.weight: 102/128 experts used
blk.11.ffn_down_exps.weight: 102/128 experts used
blk.12.ffn_gate_exps.weight: 112/128 experts used
blk.12.ffn_up_exps.weight: 112/128 experts used
blk.12.ffn_down_exps.weight: 112/128 experts used
blk.13.ffn_gate_exps.weight: 109/128 experts used
blk.13.ffn_up_exps.weight: 109/128 experts used
blk.13.ffn_down_exps.weight: 109/128 experts used
blk.14.ffn_gate_exps.weight: 118/128 experts used
blk.14.ffn_up_exps.weight: 118/128 experts used
blk.14.ffn_down_exps.weight: 118/128 experts used
blk.15.ffn_gate_exps.weight: 118/128 experts used
blk.15.ffn_up_exps.weight: 118/128 experts used
blk.15.ffn_down_exps.weight: 118/128 experts used
blk.16.ffn_gate_exps.weight: 99/128 experts used
blk.16.ffn_up_exps.weight: 99/128 experts used
blk.16.ffn_down_exps.weight: 99/128 experts used
blk.17.ffn_gate_exps.weight: 105/128 experts used
blk.17.ffn_up_exps.weight: 105/128 experts used
blk.17.ffn_down_exps.weight: 105/128 experts used
blk.18.ffn_gate_exps.weight: 100/128 experts used
blk.18.ffn_up_exps.weight: 100/128 experts used
blk.18.ffn_down_exps.weight: 100/128 experts used
blk.19.ffn_gate_exps.weight: 103/128 experts used
blk.19.ffn_up_exps.weight: 103/128 experts used
blk.19.ffn_down_exps.weight: 103/128 experts used
blk.20.ffn_gate_exps.weight: 99/128 experts used
blk.20.ffn_up_exps.weight: 99/128 experts used
blk.20.ffn_down_exps.weight: 99/128 experts used
blk.21.ffn_gate_exps.weight: 98/128 experts used
blk.21.ffn_up_exps.weight: 98/128 experts used
blk.21.ffn_down_exps.weight: 98/128 experts used
blk.22.ffn_gate_exps.weight: 97/128 experts used
blk.22.ffn_up_exps.weight: 97/128 experts used
blk.22.ffn_down_exps.weight: 97/128 experts used
blk.23.ffn_gate_exps.weight: 88/128 experts used
blk.23.ffn_up_exps.weight: 88/128 experts used
blk.23.ffn_down_exps.weight: 88/128 experts used
blk.24.ffn_gate_exps.weight: 93/128 experts used
blk.24.ffn_up_exps.weight: 93/128 experts used
blk.24.ffn_down_exps.weight: 93/128 experts used
blk.25.ffn_gate_exps.weight: 95/128 experts used
blk.25.ffn_up_exps.weight: 95/128 experts used
blk.25.ffn_down_exps.weight: 95/128 experts used
blk.26.ffn_gate_exps.weight: 99/128 experts used
blk.26.ffn_up_exps.weight: 99/128 experts used
blk.26.ffn_down_exps.weight: 99/128 experts used
blk.27.ffn_gate_exps.weight: 92/128 experts used
blk.27.ffn_up_exps.weight: 92/128 experts used
blk.27.ffn_down_exps.weight: 92/128 experts used
blk.28.ffn_gate_exps.weight: 98/128 experts used
blk.28.ffn_up_exps.weight: 98/128 experts used
blk.28.ffn_down_exps.weight: 98/128 experts used
blk.29.ffn_gate_exps.weight: 101/128 experts used
blk.29.ffn_up_exps.weight: 101/128 experts used
blk.29.ffn_down_exps.weight: 101/128 experts used
blk.30.ffn_gate_exps.weight: 92/128 experts used
blk.30.ffn_up_exps.weight: 92/128 experts used
blk.30.ffn_down_exps.weight: 92/128 experts used
blk.31.ffn_gate_exps.weight: 104/128 experts used
blk.31.ffn_up_exps.weight: 104/128 experts used
blk.31.ffn_down_exps.weight: 104/128 experts used
blk.32.ffn_gate_exps.weight: 99/128 experts used
blk.32.ffn_up_exps.weight: 99/128 experts used
blk.32.ffn_down_exps.weight: 99/128 experts used
blk.33.ffn_gate_exps.weight: 93/128 experts used
blk.33.ffn_up_exps.weight: 93/128 experts used
blk.33.ffn_down_exps.weight: 93/128 experts used
blk.34.ffn_gate_exps.weight: 84/128 experts used
blk.34.ffn_up_exps.weight: 84/128 experts used
blk.34.ffn_down_exps.weight: 84/128 experts used

@Panchovix
Copy link

Hello, I was wondering what would be the behavior when using a system with multiGPU and then offloading some experts to other GPUs and to the CPU?

i.e., I have this complex offload for DeepSeek

./llama-server -m '/models_llm/DeepSeek-R1-0528-IQ4_XS-merged.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" \
-ot "blk.(7|8|9).ffn.=CUDA1" \
-ot "blk.(10|11|12).ffn.=CUDA2" \
-ot "blk.(13|14|15|16).ffn.=CUDA3" \
-ot "blk.(17|18|19).ffn.=CUDA4" \
-ot "blk.(20|21|22).ffn.=CUDA5" \
-ot "blk.(23|24|25|26|27|28|29).ffn.=CUDA6" \
-ot exps=CPU \
-fa -mg 0 -ub 1024

Would be the copy of only the used experts also apply to the GPUs that manage some others experts?
(Hope I explained myself here, English is not my 1st language).

@slaren
Copy link
Member Author

slaren commented Aug 16, 2025

This only affects experts stored in the CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants