Skip to content

Refactor batch size handling for offloading host operations to device#1214

Closed
DocShotgun wants to merge 2 commits intoikawrakow:mainfrom
DocShotgun:ggml-op-offload-min-batch
Closed

Refactor batch size handling for offloading host operations to device#1214
DocShotgun wants to merge 2 commits intoikawrakow:mainfrom
DocShotgun:ggml-op-offload-min-batch

Conversation

@DocShotgun
Copy link
Copy Markdown

This is an attempt to make the behavior for handling the minimum batch size for offloading host operations to device consistent between backends, as well as incorporate my changes from ggml-org/llama.cpp#18535 to allow manual overriding of the threshold by the user.

Current Behavior:

In llama.cpp, the default threshold for offloading prompt processing to GPU is 32. After ggml-org/llama.cpp#18535, we can override this value with the env var GGML_OP_OFFLOAD_MIN_BATCH.

In ik_llama.cpp, the default threshold is a heuristic defined by 32 * total_experts / active_experts, however this is only implemented for the CUDA backend. The other backends (SYCL, Vulkan, CANN) retain the previous llama.cpp behavior of a hardcoded 32. Additionally, a compile-time env var GGML_CUDA_MIN_BATCH_OFFLOAD can be used to change the 32 value used in the heuristic formula. ik_llama.cpp does not appear to have the equivalent Metal op offload that mainline llama.cpp has.

PR Changes:

  • Extend the MoE min batch size heuristic used by the CUDA backend to the other backends in ik_llama.cpp (SYCL, Vulkan, CANN)
  • Renamed the compile-time env var GGML_CUDA_MIN_BATCH_OFFLOAD -> GGML_OP_OFFLOAD_HEURISTIC_MIN to reflect this now being generic across backends
  • Implement the runtime env var GGML_OP_OFFLOAD_MIN_BATCH to allow the user to skip the heuristic and override the min batch size to a specified value - defaults to using the heuristic if not specified

TODO:

  • SYCL was previously skipping op offload for GGML_OP_MUL_MAT_ID while this doesn't appear to be the case in main llama.cpp, and Vulkan had special handling for GGML_OP_MUL_MAT_ID to use ne[2] instead of ne[1] for the comparison. I'm not entirely sure why. If these changes are problematic, I'm open to reverting the addition of the MoE min batch heuristic to the other backends and simplifying this PR to just allowing the user to override the min batch size with GGML_OP_OFFLOAD_MIN_BATCH.

* Rename GGML_CUDA_MIN_BATCH_OFFLOAD → GGML_OP_OFFLOAD_HEURISTIC_MIN to make it generic
* Ported MoE op offload batch heuristic to SYCL, Vulkan, CANN
* Added GGML_OP_OFFLOAD_MIN_BATCH runtime env var as with llama.cpp #18535, which allows the user to set an explicit op offload threshold
@ikawrakow
Copy link
Copy Markdown
Owner

Thank you for the PR.

Do you know how the op offload strategy in ik_llama.cpp works? This PR breaks it for MoE models.

The CANN and SYCL back-ends are not functional, so no need to make any changes there. I should remove them one of these days.

The Vulkan back-end has not been updated for a long time, so my guess is that llama.cpp is a better choice than ik_llama.cpp for Vulkan users at this point.

Hence, effectively, the only back-end that requires control over the minimum batch size is CUDA. The PR here that allows to set the minimum offload batch size at compile time precedes your mainline PR by about 6 months (see #520). Your PR breaks it for everyone who has been using GGML_CUDA_MIN_BATCH_OFFLOAD in their build scripts. I guess, you have decided to change the name of the compile time variable to remove the reference to CUDA, so it can be used for all other back-ends. But I'm not sure this makes sense. The minimum meaningful offload batch size is definitely back-end dependent. In my testing, on the same hardware, the Vulkan back-end is significantly slower than the CUDA back-end, so given the same CPU, the optimum minimum offload batch size will be higher. Hence, to me, GGML_CUDA_MIN_BATCH_OFFLOAD makes perfect sense, and GGML_OP_OFFLOAD_HEURISTIC_MIN does not. If one wanted to extend this functionality to other back-ends, then it would be, for instance, GGML_VULKAN_MIN_BATCH_OFFLOAD rather than GGML_OP_OFFLOAD_HEURISTIC_MIN. Especially if one of these days I finally decided to make it possible to enable the CUDA and the Vulkan back-ends in the same build.

I can no say that I'm a big fan of using environment variables for controlling how the app behaves. I know in mainline land they just love doing that. But on my book, the only time the usage of environment variables makes sense is in the case of a library that has not provided an API to set various parameters. This is not the case here. Besides, I can just add -cuda offload-batch-size=12 instead of, for my taste, clunky GGML_OP_OFFLOAD_HEURISTIC_MIN=12 ./bin/llama-something.

@DocShotgun
Copy link
Copy Markdown
Author

DocShotgun commented Feb 2, 2026

If the other backends are not to be used at the moment, then it makes sense to keep it to CUDA only.

Regarding offload-batch-size, perhaps I'm misunderstanding the intent - are we meant to only adjust the pre-heuristic value for MoE models for CUDA?

Comment thread ggml/src/ggml-cuda.cu
auto ctx = (const ggml_backend_cuda_context *)backend->context;
int min_batch_size = ctx->offload_batch_size; //originally: GGML_CUDA_MIN_BATCH_OFFLOAD;

if (ctx->op_offload_min_batch_size >= 0) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, for whatever reason, somebody decided to use this new way of determining minimum offload batch size, it overrides the existing functionality for MoE models below.

Also, the batch size for MoE related ops (GGML_OP_MUL_MAT_ID and GGML_OP_MOE_FUSED_UP_GATE is not given by op->ne[1], it is op->ne[2] instead.

Copy link
Copy Markdown
Author

@DocShotgun DocShotgun Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to allow the user to override the heuristic with a raw minimum batch size value, so we skip the MoE calculation entirely if the user chooses to do so. If this is misguided in your view, then feel free to close this PR, as that is the main purpose of it.

Correct me if I'm misunderstanding here, but I do believe that the value that should be exposed for the user to adjust is the final batch size to be compared with the size of the input, and not the min_batch_size that subsequently undergoes the MoE adjustment with min_batch_size * total_experts / active_experts. If we are to bench speeds with op offload forced on/off to find the real break-even point on particular model/hardware, the value that is empirically determined should not be adjusted any further based on MoE. offload-batch-size could be adjusted to allow for this as well, without having to introduce a new env var.

@DocShotgun
Copy link
Copy Markdown
Author

Closing this, will advise CPU+CUDA users to use the -cuda offload-batch-size=N arg that ikawrakow suggests, noting that for MoE models this will further be adjusted by N * total_experts / active_experts.

@DocShotgun DocShotgun closed this Feb 3, 2026
@DocShotgun
Copy link
Copy Markdown
Author

Apologies for the misguided PR. I found #910 which documented your intended way for specifying this parameter. Unfortunately there are lots of new features and no centralized resource with all of the niche settings like this. None of my friends who are avid CPU+GPU MoE offloaders were aware of this CLI argument existing either, so I will direct them to use it lol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants