Refactor batch size handling for offloading host operations to device#1214
Refactor batch size handling for offloading host operations to device#1214DocShotgun wants to merge 2 commits intoikawrakow:mainfrom
Conversation
* Rename GGML_CUDA_MIN_BATCH_OFFLOAD → GGML_OP_OFFLOAD_HEURISTIC_MIN to make it generic * Ported MoE op offload batch heuristic to SYCL, Vulkan, CANN * Added GGML_OP_OFFLOAD_MIN_BATCH runtime env var as with llama.cpp #18535, which allows the user to set an explicit op offload threshold
|
Thank you for the PR. Do you know how the op offload strategy in The CANN and SYCL back-ends are not functional, so no need to make any changes there. I should remove them one of these days. The Vulkan back-end has not been updated for a long time, so my guess is that Hence, effectively, the only back-end that requires control over the minimum batch size is CUDA. The PR here that allows to set the minimum offload batch size at compile time precedes your mainline PR by about 6 months (see #520). Your PR breaks it for everyone who has been using I can no say that I'm a big fan of using environment variables for controlling how the app behaves. I know in mainline land they just love doing that. But on my book, the only time the usage of environment variables makes sense is in the case of a library that has not provided an API to set various parameters. This is not the case here. Besides, I can just add |
|
If the other backends are not to be used at the moment, then it makes sense to keep it to CUDA only. Regarding |
| auto ctx = (const ggml_backend_cuda_context *)backend->context; | ||
| int min_batch_size = ctx->offload_batch_size; //originally: GGML_CUDA_MIN_BATCH_OFFLOAD; | ||
|
|
||
| if (ctx->op_offload_min_batch_size >= 0) { |
There was a problem hiding this comment.
If, for whatever reason, somebody decided to use this new way of determining minimum offload batch size, it overrides the existing functionality for MoE models below.
Also, the batch size for MoE related ops (GGML_OP_MUL_MAT_ID and GGML_OP_MOE_FUSED_UP_GATE is not given by op->ne[1], it is op->ne[2] instead.
There was a problem hiding this comment.
The intention is to allow the user to override the heuristic with a raw minimum batch size value, so we skip the MoE calculation entirely if the user chooses to do so. If this is misguided in your view, then feel free to close this PR, as that is the main purpose of it.
Correct me if I'm misunderstanding here, but I do believe that the value that should be exposed for the user to adjust is the final batch size to be compared with the size of the input, and not the min_batch_size that subsequently undergoes the MoE adjustment with min_batch_size * total_experts / active_experts. If we are to bench speeds with op offload forced on/off to find the real break-even point on particular model/hardware, the value that is empirically determined should not be adjusted any further based on MoE. offload-batch-size could be adjusted to allow for this as well, without having to introduce a new env var.
|
Closing this, will advise CPU+CUDA users to use the |
|
Apologies for the misguided PR. I found #910 which documented your intended way for specifying this parameter. Unfortunately there are lots of new features and no centralized resource with all of the niche settings like this. None of my friends who are avid CPU+GPU MoE offloaders were aware of this CLI argument existing either, so I will direct them to use it lol. |
This is an attempt to make the behavior for handling the minimum batch size for offloading host operations to device consistent between backends, as well as incorporate my changes from ggml-org/llama.cpp#18535 to allow manual overriding of the threshold by the user.
Current Behavior:
In llama.cpp, the default threshold for offloading prompt processing to GPU is
32. After ggml-org/llama.cpp#18535, we can override this value with the env varGGML_OP_OFFLOAD_MIN_BATCH.In ik_llama.cpp, the default threshold is a heuristic defined by
32 * total_experts / active_experts, however this is only implemented for the CUDA backend. The other backends (SYCL, Vulkan, CANN) retain the previous llama.cpp behavior of a hardcoded32. Additionally, a compile-time env varGGML_CUDA_MIN_BATCH_OFFLOADcan be used to change the32value used in the heuristic formula. ik_llama.cpp does not appear to have the equivalent Metal op offload that mainline llama.cpp has.PR Changes:
GGML_CUDA_MIN_BATCH_OFFLOAD->GGML_OP_OFFLOAD_HEURISTIC_MINto reflect this now being generic across backendsGGML_OP_OFFLOAD_MIN_BATCHto allow the user to skip the heuristic and override the min batch size to a specified value - defaults to using the heuristic if not specifiedTODO:
GGML_OP_MUL_MAT_IDwhile this doesn't appear to be the case in main llama.cpp, and Vulkan had special handling forGGML_OP_MUL_MAT_IDto usene[2]instead ofne[1]for the comparison. I'm not entirely sure why. If these changes are problematic, I'm open to reverting the addition of the MoE min batch heuristic to the other backends and simplifying this PR to just allowing the user to override the min batch size withGGML_OP_OFFLOAD_MIN_BATCH.