Skip to content

Misc. bug: llama_sampler_penalties_apply is very inefficient #18556

@sliedes

Description

@sliedes

Name and Version

$ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 7527 (7f459c9)
built with GNU 14.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

CTX=8192; NP=2; llama-server --hf-repo ggml-org/gpt-oss-20b-GGUF --ctx-size $((CTX*NP)) -ngl 999 --repeat-penalty 1.2 --kv-unified

Problem description & steps to reproduce

I observed, against my expectations, llama-cpp running on CUDA to become significantly slower when there was some CPU load on my server (a build job running in the background). Investigating the issue, it turns out llama_sampler_penalties_apply is a significant CPU hot spot, to the extent that it was the bottleneck for the entire inference process (it had significant trouble keeping my GPU busy even when the machine was idle).

Looking at the code, this is probably an issue particularly for large-vocabulary models like gpt-oss-20b:

static void llama_sampler_penalties_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {

What happens is that the code does a std::unordered_map access for each of the 200k candidates in the inner loop, once per output token. This turns out to take some time.

I quickly hacked code like this to see if it could be made faster, but I don't understand the code well enough to know if it's correct (generally, cur_p->sorted was false, but I assume that refers to sorted by logprobs?), and, looking quickly, the output seemed ok:

static void llama_sampler_penalties_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
    auto * ctx = (llama_sampler_penalties *) smpl->ctx;

    if ((ctx->penalty_last_n == 0) ||
        (ctx->penalty_repeat == 1.0f && ctx->penalty_freq == 0.0f && ctx->penalty_present == 0.0f)) {
        return;
    }

    // START OPTIMIZED CODE PATH
    // Heuristic: If the array size matches the max token ID in the map,
    // and the first few tokens match their indices, we assume it is "Indexable".
    // This allows O(1) lookup instead of O(N) scanning.

    // Check if array is likely in raw "Logit Order" (Index == ID)
    bool is_indexable = (cur_p->size > 0 && cur_p->data[0].id == 0);

    // Extra safety check: verify the last element matches to be sure no one shuffled it
    if (is_indexable && cur_p->size > 1) {
         if (cur_p->data[cur_p->size - 1].id != (int)(cur_p->size - 1)) {
             is_indexable = false;
         }
    }

    if (is_indexable) {
        for (const auto & entry : ctx->token_count) {
            const llama_token id = entry.first;
            const int count = entry.second;

            // Boundary check (just in case)
            if (id < 0 || (size_t)id >= cur_p->size) continue;

            // Direct pointer access
            llama_token_data * token_data = &cur_p->data[id];

            // Apply penalties
            if (token_data->logit <= 0) {
                token_data->logit *= ctx->penalty_repeat;
            } else {
                token_data->logit /= ctx->penalty_repeat;
            }
            token_data->logit -= float(count) * ctx->penalty_freq + float(count > 0) * ctx->penalty_present;
        }

        cur_p->sorted = false;
        return;
    } else {
        LLAMA_LOG_WARN("llama_sampler_penalties_apply: cur_p is not indexable, falling back to slow path\n");
    }
    // END OPTIMIZED CODE PATH

    // Apply frequency and presence penalties to the cur_p, slow path
    for (size_t i = 0; i < cur_p->size; ++i) {
        const auto token_iter = ctx->token_count.find(cur_p->data[i].id);
        if (token_iter == ctx->token_count.end()) {
            continue;
        }

        const int count = token_iter->second;

        assert(count > 0 && count <= ctx->penalty_last_n);

        // The academic publication that described this technique actually just only divided, but that would cause tokens with negative logits to become more likely, which is obviously wrong.
        // This is common fix for this problem, which is to multiply by the penalty instead of dividing.
        if (cur_p->data[i].logit <= 0) {
            cur_p->data[i].logit *= ctx->penalty_repeat;
        } else {
            cur_p->data[i].logit /= ctx->penalty_repeat;
        }

        cur_p->data[i].logit -= float(count) * ctx->penalty_freq + float(count > 0) * ctx->penalty_present;
    }

    cur_p->sorted = false;
}

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions