sched : copy only the used experts when offloading prompt processing #15346

slaren · 2025-08-15T14:26:27Z

When offloading prompt processing to GPU with MoE weights on CPU, all the experts are copied to VRAM. With this change, only the experts that are actually used are copied. With models where some experts are often unused, such as gpt-oss, this can reduce significantly the amount of data that needs to be copied to VRAM.

GGML_CUDA=1 scripts/compare-commits.sh master sl/sched-used-experts llama-bench -m gpt-oss-20b-mxfp4.gguf -ot "exps=CPU" -fa 1 -n 0 -p 2048 -ub "128-2048*2"

Model	Microbatch size	Test	t/s master	t/s sl/sched-used-experts	Speedup
gpt-oss 20B MXFP4 MoE	128	pp2048	124.24	202.92	1.63
gpt-oss 20B MXFP4 MoE	256	pp2048	253.18	372.75	1.47
gpt-oss 20B MXFP4 MoE	512	pp2048	502.73	685.31	1.36
gpt-oss 20B MXFP4 MoE	1024	pp2048	1015.20	1178.50	1.16
gpt-oss 20B MXFP4 MoE	2048	pp2048	1629.92	2089.53	1.28

GGML_CUDA=1 scripts/compare-commits.sh master sl/sched-used-experts llama-bench -m gpt-oss-120b-mxfp4-00001-of-00003.gguf -ot "exps=CPU" -fa 1 -n 0 -p 2048 -ub "128-2048*2" -r 1

Model	Microbatch size	Test	t/s master	t/s sl/sched-used-experts	Speedup
gpt-oss 120B MXFP4 MoE	128	pp2048	22.37	59.09	2.64
gpt-oss 120B MXFP4 MoE	256	pp2048	41.80	99.83	2.39
gpt-oss 120B MXFP4 MoE	512	pp2048	87.93	182.89	2.08
gpt-oss 120B MXFP4 MoE	1024	pp2048	168.17	255.84	1.52
gpt-oss 120B MXFP4 MoE	2048	pp2048	345.01	483.99	1.40

Short tests with real data and respecting the chat template (using llama-cli -f <prompt.txt>) also show a significant improvement.

ggerganov

With models where some experts are often unused, such as gpt-oss

Interesting - I would have expected that a batch of 2048 tokens would pretty much always activate all experts.

ggerganov · 2025-08-16T05:43:57Z

ggml/src/ggml-backend.cpp

+                        const size_t padding = 512;
+                        const size_t padding_end = last_id < input->ne[2] - 1 ? std::min<size_t>(expert_size, padding) : 0;
+
+                        ggml_backend_tensor_set_async(split_backend,
+                            input_cpy,
+                            (const uint8_t *)input->data + expert_offset, expert_offset,
+                            // copy a bit extra to ensure there are no NaNs in the padding
+                            expert_size_copy + padding_end);


I'm not sure I understand the padding logic and why it is needed.

The padding is necessary for MMQ. Normally the CUDA backend adds the padding at the end of the tensor, but with MoE, the padding of each expert is effectively the beggining of the next expert. We need to ensure that the padding after every the expert does not contain NaNs, and to do that we copy a few bytes from the next expert. It would also be possible to use ggml_backend_tensor_memset, but this way we avoid an extra call, at the cost of increasing the transfer size by a negligible amount.

slaren · 2025-08-16T10:51:43Z

Interesting - I would have expected that a batch of 2048 tokens would pretty much always activate all experts.

I was surprised too, I was only expecting to improve small batch sizes. This is how it looks when processing some code with llama-cli (respecting the chat template):

BS 512

blk.0.ffn_gate_exps.weight: 95/128 experts used
blk.0.ffn_up_exps.weight: 95/128 experts used
blk.0.ffn_down_exps.weight: 95/128 experts used
blk.1.ffn_gate_exps.weight: 91/128 experts used
blk.1.ffn_up_exps.weight: 91/128 experts used
blk.1.ffn_down_exps.weight: 91/128 experts used
blk.2.ffn_gate_exps.weight: 72/128 experts used
blk.2.ffn_up_exps.weight: 72/128 experts used
blk.2.ffn_down_exps.weight: 72/128 experts used
blk.3.ffn_gate_exps.weight: 74/128 experts used
blk.3.ffn_up_exps.weight: 74/128 experts used
blk.3.ffn_down_exps.weight: 74/128 experts used
blk.4.ffn_gate_exps.weight: 73/128 experts used
blk.4.ffn_up_exps.weight: 73/128 experts used
blk.4.ffn_down_exps.weight: 73/128 experts used
blk.5.ffn_gate_exps.weight: 78/128 experts used
blk.5.ffn_up_exps.weight: 78/128 experts used
blk.5.ffn_down_exps.weight: 78/128 experts used
blk.6.ffn_gate_exps.weight: 87/128 experts used
blk.6.ffn_up_exps.weight: 87/128 experts used
blk.6.ffn_down_exps.weight: 87/128 experts used
blk.7.ffn_gate_exps.weight: 95/128 experts used
blk.7.ffn_up_exps.weight: 95/128 experts used
blk.7.ffn_down_exps.weight: 95/128 experts used
blk.8.ffn_gate_exps.weight: 78/128 experts used
blk.8.ffn_up_exps.weight: 78/128 experts used
blk.8.ffn_down_exps.weight: 78/128 experts used
blk.9.ffn_gate_exps.weight: 81/128 experts used
blk.9.ffn_up_exps.weight: 81/128 experts used
blk.9.ffn_down_exps.weight: 81/128 experts used
blk.10.ffn_gate_exps.weight: 70/128 experts used
blk.10.ffn_up_exps.weight: 70/128 experts used
blk.10.ffn_down_exps.weight: 70/128 experts used
blk.11.ffn_gate_exps.weight: 73/128 experts used
blk.11.ffn_up_exps.weight: 73/128 experts used
blk.11.ffn_down_exps.weight: 73/128 experts used
blk.12.ffn_gate_exps.weight: 80/128 experts used
blk.12.ffn_up_exps.weight: 80/128 experts used
blk.12.ffn_down_exps.weight: 80/128 experts used
blk.13.ffn_gate_exps.weight: 77/128 experts used
blk.13.ffn_up_exps.weight: 77/128 experts used
blk.13.ffn_down_exps.weight: 77/128 experts used
blk.14.ffn_gate_exps.weight: 87/128 experts used
blk.14.ffn_up_exps.weight: 87/128 experts used
blk.14.ffn_down_exps.weight: 87/128 experts used
blk.15.ffn_gate_exps.weight: 80/128 experts used
blk.15.ffn_up_exps.weight: 80/128 experts used
blk.15.ffn_down_exps.weight: 80/128 experts used
blk.16.ffn_gate_exps.weight: 70/128 experts used
blk.16.ffn_up_exps.weight: 70/128 experts used
blk.16.ffn_down_exps.weight: 70/128 experts used
blk.17.ffn_gate_exps.weight: 80/128 experts used
blk.17.ffn_up_exps.weight: 80/128 experts used
blk.17.ffn_down_exps.weight: 80/128 experts used
blk.18.ffn_gate_exps.weight: 70/128 experts used
blk.18.ffn_up_exps.weight: 70/128 experts used
blk.18.ffn_down_exps.weight: 70/128 experts used
blk.19.ffn_gate_exps.weight: 77/128 experts used
blk.19.ffn_up_exps.weight: 77/128 experts used
blk.19.ffn_down_exps.weight: 77/128 experts used
blk.20.ffn_gate_exps.weight: 72/128 experts used
blk.20.ffn_up_exps.weight: 72/128 experts used
blk.20.ffn_down_exps.weight: 72/128 experts used
blk.21.ffn_gate_exps.weight: 78/128 experts used
blk.21.ffn_up_exps.weight: 78/128 experts used
blk.21.ffn_down_exps.weight: 78/128 experts used
blk.22.ffn_gate_exps.weight: 60/128 experts used
blk.22.ffn_up_exps.weight: 60/128 experts used
blk.22.ffn_down_exps.weight: 60/128 experts used
blk.23.ffn_gate_exps.weight: 69/128 experts used
blk.23.ffn_up_exps.weight: 69/128 experts used
blk.23.ffn_down_exps.weight: 69/128 experts used
blk.24.ffn_gate_exps.weight: 70/128 experts used
blk.24.ffn_up_exps.weight: 70/128 experts used
blk.24.ffn_down_exps.weight: 70/128 experts used
blk.25.ffn_gate_exps.weight: 70/128 experts used
blk.25.ffn_up_exps.weight: 70/128 experts used
blk.25.ffn_down_exps.weight: 70/128 experts used
blk.26.ffn_gate_exps.weight: 79/128 experts used
blk.26.ffn_up_exps.weight: 79/128 experts used
blk.26.ffn_down_exps.weight: 79/128 experts used
blk.27.ffn_gate_exps.weight: 74/128 experts used
blk.27.ffn_up_exps.weight: 74/128 experts used
blk.27.ffn_down_exps.weight: 74/128 experts used
blk.28.ffn_gate_exps.weight: 74/128 experts used
blk.28.ffn_up_exps.weight: 74/128 experts used
blk.28.ffn_down_exps.weight: 74/128 experts used
blk.29.ffn_gate_exps.weight: 76/128 experts used
blk.29.ffn_up_exps.weight: 76/128 experts used
blk.29.ffn_down_exps.weight: 76/128 experts used
blk.30.ffn_gate_exps.weight: 68/128 experts used
blk.30.ffn_up_exps.weight: 68/128 experts used
blk.30.ffn_down_exps.weight: 68/128 experts used
blk.31.ffn_gate_exps.weight: 81/128 experts used
blk.31.ffn_up_exps.weight: 81/128 experts used
blk.31.ffn_down_exps.weight: 81/128 experts used
blk.32.ffn_gate_exps.weight: 64/128 experts used
blk.32.ffn_up_exps.weight: 64/128 experts used
blk.32.ffn_down_exps.weight: 64/128 experts used
blk.33.ffn_gate_exps.weight: 59/128 experts used
blk.33.ffn_up_exps.weight: 59/128 experts used
blk.33.ffn_down_exps.weight: 59/128 experts used
blk.34.ffn_gate_exps.weight: 50/128 experts used
blk.34.ffn_up_exps.weight: 50/128 experts used
blk.34.ffn_down_exps.weight: 50/128 experts used

BS 2048

blk.0.ffn_up_exps.weight: 128/128 experts used
blk.0.ffn_down_exps.weight: 128/128 experts used
blk.1.ffn_gate_exps.weight: 121/128 experts used
blk.1.ffn_up_exps.weight: 121/128 experts used
blk.1.ffn_down_exps.weight: 121/128 experts used
blk.2.ffn_gate_exps.weight: 116/128 experts used
blk.2.ffn_up_exps.weight: 116/128 experts used
blk.2.ffn_down_exps.weight: 116/128 experts used
blk.3.ffn_gate_exps.weight: 111/128 experts used
blk.3.ffn_up_exps.weight: 111/128 experts used
blk.3.ffn_down_exps.weight: 111/128 experts used
blk.4.ffn_gate_exps.weight: 115/128 experts used
blk.4.ffn_up_exps.weight: 115/128 experts used
blk.4.ffn_down_exps.weight: 115/128 experts used
blk.5.ffn_gate_exps.weight: 120/128 experts used
blk.5.ffn_up_exps.weight: 120/128 experts used
blk.5.ffn_down_exps.weight: 120/128 experts used
blk.6.ffn_gate_exps.weight: 121/128 experts used
blk.6.ffn_up_exps.weight: 121/128 experts used
blk.6.ffn_down_exps.weight: 121/128 experts used
blk.7.ffn_gate_exps.weight: 120/128 experts used
blk.7.ffn_up_exps.weight: 120/128 experts used
blk.7.ffn_down_exps.weight: 120/128 experts used
blk.8.ffn_gate_exps.weight: 119/128 experts used
blk.8.ffn_up_exps.weight: 119/128 experts used
blk.8.ffn_down_exps.weight: 119/128 experts used
blk.9.ffn_gate_exps.weight: 112/128 experts used
blk.9.ffn_up_exps.weight: 112/128 experts used
blk.9.ffn_down_exps.weight: 112/128 experts used
blk.10.ffn_gate_exps.weight: 108/128 experts used
blk.10.ffn_up_exps.weight: 108/128 experts used
blk.10.ffn_down_exps.weight: 108/128 experts used
blk.11.ffn_gate_exps.weight: 102/128 experts used
blk.11.ffn_up_exps.weight: 102/128 experts used
blk.11.ffn_down_exps.weight: 102/128 experts used
blk.12.ffn_gate_exps.weight: 112/128 experts used
blk.12.ffn_up_exps.weight: 112/128 experts used
blk.12.ffn_down_exps.weight: 112/128 experts used
blk.13.ffn_gate_exps.weight: 109/128 experts used
blk.13.ffn_up_exps.weight: 109/128 experts used
blk.13.ffn_down_exps.weight: 109/128 experts used
blk.14.ffn_gate_exps.weight: 118/128 experts used
blk.14.ffn_up_exps.weight: 118/128 experts used
blk.14.ffn_down_exps.weight: 118/128 experts used
blk.15.ffn_gate_exps.weight: 118/128 experts used
blk.15.ffn_up_exps.weight: 118/128 experts used
blk.15.ffn_down_exps.weight: 118/128 experts used
blk.16.ffn_gate_exps.weight: 99/128 experts used
blk.16.ffn_up_exps.weight: 99/128 experts used
blk.16.ffn_down_exps.weight: 99/128 experts used
blk.17.ffn_gate_exps.weight: 105/128 experts used
blk.17.ffn_up_exps.weight: 105/128 experts used
blk.17.ffn_down_exps.weight: 105/128 experts used
blk.18.ffn_gate_exps.weight: 100/128 experts used
blk.18.ffn_up_exps.weight: 100/128 experts used
blk.18.ffn_down_exps.weight: 100/128 experts used
blk.19.ffn_gate_exps.weight: 103/128 experts used
blk.19.ffn_up_exps.weight: 103/128 experts used
blk.19.ffn_down_exps.weight: 103/128 experts used
blk.20.ffn_gate_exps.weight: 99/128 experts used
blk.20.ffn_up_exps.weight: 99/128 experts used
blk.20.ffn_down_exps.weight: 99/128 experts used
blk.21.ffn_gate_exps.weight: 98/128 experts used
blk.21.ffn_up_exps.weight: 98/128 experts used
blk.21.ffn_down_exps.weight: 98/128 experts used
blk.22.ffn_gate_exps.weight: 97/128 experts used
blk.22.ffn_up_exps.weight: 97/128 experts used
blk.22.ffn_down_exps.weight: 97/128 experts used
blk.23.ffn_gate_exps.weight: 88/128 experts used
blk.23.ffn_up_exps.weight: 88/128 experts used
blk.23.ffn_down_exps.weight: 88/128 experts used
blk.24.ffn_gate_exps.weight: 93/128 experts used
blk.24.ffn_up_exps.weight: 93/128 experts used
blk.24.ffn_down_exps.weight: 93/128 experts used
blk.25.ffn_gate_exps.weight: 95/128 experts used
blk.25.ffn_up_exps.weight: 95/128 experts used
blk.25.ffn_down_exps.weight: 95/128 experts used
blk.26.ffn_gate_exps.weight: 99/128 experts used
blk.26.ffn_up_exps.weight: 99/128 experts used
blk.26.ffn_down_exps.weight: 99/128 experts used
blk.27.ffn_gate_exps.weight: 92/128 experts used
blk.27.ffn_up_exps.weight: 92/128 experts used
blk.27.ffn_down_exps.weight: 92/128 experts used
blk.28.ffn_gate_exps.weight: 98/128 experts used
blk.28.ffn_up_exps.weight: 98/128 experts used
blk.28.ffn_down_exps.weight: 98/128 experts used
blk.29.ffn_gate_exps.weight: 101/128 experts used
blk.29.ffn_up_exps.weight: 101/128 experts used
blk.29.ffn_down_exps.weight: 101/128 experts used
blk.30.ffn_gate_exps.weight: 92/128 experts used
blk.30.ffn_up_exps.weight: 92/128 experts used
blk.30.ffn_down_exps.weight: 92/128 experts used
blk.31.ffn_gate_exps.weight: 104/128 experts used
blk.31.ffn_up_exps.weight: 104/128 experts used
blk.31.ffn_down_exps.weight: 104/128 experts used
blk.32.ffn_gate_exps.weight: 99/128 experts used
blk.32.ffn_up_exps.weight: 99/128 experts used
blk.32.ffn_down_exps.weight: 99/128 experts used
blk.33.ffn_gate_exps.weight: 93/128 experts used
blk.33.ffn_up_exps.weight: 93/128 experts used
blk.33.ffn_down_exps.weight: 93/128 experts used
blk.34.ffn_gate_exps.weight: 84/128 experts used
blk.34.ffn_up_exps.weight: 84/128 experts used
blk.34.ffn_down_exps.weight: 84/128 experts used

Panchovix · 2025-08-16T18:56:04Z

Hello, I was wondering what would be the behavior when using a system with multiGPU and then offloading some experts to other GPUs and to the CPU?

i.e., I have this complex offload for DeepSeek

./llama-server -m '/models_llm/DeepSeek-R1-0528-IQ4_XS-merged.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" \
-ot "blk.(7|8|9).ffn.=CUDA1" \
-ot "blk.(10|11|12).ffn.=CUDA2" \
-ot "blk.(13|14|15|16).ffn.=CUDA3" \
-ot "blk.(17|18|19).ffn.=CUDA4" \
-ot "blk.(20|21|22).ffn.=CUDA5" \
-ot "blk.(23|24|25|26|27|28|29).ffn.=CUDA6" \
-ot exps=CPU \
-fa -mg 0 -ub 1024

Would be the copy of only the used experts also apply to the GPUs that manage some others experts?
(Hope I explained myself here, English is not my 1st language).

slaren · 2025-08-16T20:08:06Z

This only affects experts stored in the CPU.

sched : copy only the used experts when offloading prompt processing

497474b

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 15, 2025

slaren marked this pull request as ready for review August 15, 2025 14:37

ggerganov approved these changes Aug 16, 2025

View reviewed changes

ikawrakow mentioned this pull request Aug 16, 2025

Offload only activated experts to the GPU ikawrakow/ik_llama.cpp#698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sched : copy only the used experts when offloading prompt processing #15346

sched : copy only the used experts when offloading prompt processing #15346

slaren commented Aug 15, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov Aug 16, 2025

Uh oh!

slaren Aug 16, 2025

Uh oh!

slaren commented Aug 16, 2025

Uh oh!

Panchovix commented Aug 16, 2025

Uh oh!

slaren commented Aug 16, 2025

Uh oh!

Uh oh!

sched : copy only the used experts when offloading prompt processing #15346

Are you sure you want to change the base?

sched : copy only the used experts when offloading prompt processing #15346

Conversation

slaren commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Aug 16, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Aug 16, 2025

Uh oh!

Panchovix commented Aug 16, 2025

Uh oh!

slaren commented Aug 16, 2025

Uh oh!

Uh oh!

slaren commented Aug 15, 2025 •

edited

Loading