Name and Version
$ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 7527 (7f459c9)
built with GNU 14.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
CTX=8192; NP=2; llama-server --hf-repo ggml-org/gpt-oss-20b-GGUF --ctx-size $((CTX*NP)) -ngl 999 --repeat-penalty 1.2 --kv-unified
Problem description & steps to reproduce
I observed, against my expectations, llama-cpp running on CUDA to become significantly slower when there was some CPU load on my server (a build job running in the background). Investigating the issue, it turns out llama_sampler_penalties_apply is a significant CPU hot spot, to the extent that it was the bottleneck for the entire inference process (it had significant trouble keeping my GPU busy even when the machine was idle).
Looking at the code, this is probably an issue particularly for large-vocabulary models like gpt-oss-20b:
|
static void llama_sampler_penalties_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) { |
What happens is that the code does a std::unordered_map access for each of the 200k candidates in the inner loop, once per output token. This turns out to take some time.
I quickly hacked code like this to see if it could be made faster, but I don't understand the code well enough to know if it's correct (generally, cur_p->sorted was false, but I assume that refers to sorted by logprobs?), and, looking quickly, the output seemed ok:
static void llama_sampler_penalties_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
auto * ctx = (llama_sampler_penalties *) smpl->ctx;
if ((ctx->penalty_last_n == 0) ||
(ctx->penalty_repeat == 1.0f && ctx->penalty_freq == 0.0f && ctx->penalty_present == 0.0f)) {
return;
}
// START OPTIMIZED CODE PATH
// Heuristic: If the array size matches the max token ID in the map,
// and the first few tokens match their indices, we assume it is "Indexable".
// This allows O(1) lookup instead of O(N) scanning.
// Check if array is likely in raw "Logit Order" (Index == ID)
bool is_indexable = (cur_p->size > 0 && cur_p->data[0].id == 0);
// Extra safety check: verify the last element matches to be sure no one shuffled it
if (is_indexable && cur_p->size > 1) {
if (cur_p->data[cur_p->size - 1].id != (int)(cur_p->size - 1)) {
is_indexable = false;
}
}
if (is_indexable) {
for (const auto & entry : ctx->token_count) {
const llama_token id = entry.first;
const int count = entry.second;
// Boundary check (just in case)
if (id < 0 || (size_t)id >= cur_p->size) continue;
// Direct pointer access
llama_token_data * token_data = &cur_p->data[id];
// Apply penalties
if (token_data->logit <= 0) {
token_data->logit *= ctx->penalty_repeat;
} else {
token_data->logit /= ctx->penalty_repeat;
}
token_data->logit -= float(count) * ctx->penalty_freq + float(count > 0) * ctx->penalty_present;
}
cur_p->sorted = false;
return;
} else {
LLAMA_LOG_WARN("llama_sampler_penalties_apply: cur_p is not indexable, falling back to slow path\n");
}
// END OPTIMIZED CODE PATH
// Apply frequency and presence penalties to the cur_p, slow path
for (size_t i = 0; i < cur_p->size; ++i) {
const auto token_iter = ctx->token_count.find(cur_p->data[i].id);
if (token_iter == ctx->token_count.end()) {
continue;
}
const int count = token_iter->second;
assert(count > 0 && count <= ctx->penalty_last_n);
// The academic publication that described this technique actually just only divided, but that would cause tokens with negative logits to become more likely, which is obviously wrong.
// This is common fix for this problem, which is to multiply by the penalty instead of dividing.
if (cur_p->data[i].logit <= 0) {
cur_p->data[i].logit *= ctx->penalty_repeat;
} else {
cur_p->data[i].logit /= ctx->penalty_repeat;
}
cur_p->data[i].logit -= float(count) * ctx->penalty_freq + float(count > 0) * ctx->penalty_present;
}
cur_p->sorted = false;
}
First Bad Commit
No response
Relevant log output
Logs
Name and Version
$ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 7527 (7f459c9)
built with GNU 14.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
libllama (core library)
Command line
Problem description & steps to reproduce
I observed, against my expectations, llama-cpp running on CUDA to become significantly slower when there was some CPU load on my server (a build job running in the background). Investigating the issue, it turns out
llama_sampler_penalties_applyis a significant CPU hot spot, to the extent that it was the bottleneck for the entire inference process (it had significant trouble keeping my GPU busy even when the machine was idle).Looking at the code, this is probably an issue particularly for large-vocabulary models like gpt-oss-20b:
llama.cpp/src/llama-sampling.cpp
Line 1776 in 18ddaea
What happens is that the code does a
std::unordered_mapaccess for each of the 200k candidates in the inner loop, once per output token. This turns out to take some time.I quickly hacked code like this to see if it could be made faster, but I don't understand the code well enough to know if it's correct (generally,
cur_p->sortedwas false, but I assume that refers to sorted by logprobs?), and, looking quickly, the output seemed ok:First Bad Commit
No response
Relevant log output
Logs