Misc. bug: `llama_sampler_penalties_apply` is very inefficient

### Name and Version

$ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
version: 7527 (7f459c9)
built with GNU 14.3.0 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

libllama (core library)

### Command line

```shell
CTX=8192; NP=2; llama-server --hf-repo ggml-org/gpt-oss-20b-GGUF --ctx-size $((CTX*NP)) -ngl 999 --repeat-penalty 1.2 --kv-unified
```

### Problem description & steps to reproduce

I observed, against my expectations, llama-cpp running on CUDA to become significantly slower when there was some CPU load on my server (a build job running in the background). Investigating the issue, it turns out `llama_sampler_penalties_apply` is a significant CPU hot spot, to the extent that it was the bottleneck for the entire inference process (it had significant trouble keeping my GPU busy even when the machine was idle).

Looking at the code, this is probably an issue particularly for large-vocabulary models like gpt-oss-20b:

https://github.com/ggml-org/llama.cpp/blob/18ddaea2aecf7fbfe7acab77465808f3cf6200d3/src/llama-sampling.cpp#L1776

What happens is that the code does a `std::unordered_map` access for each of the 200k candidates in the inner loop, once per output token. This turns out to take some time.

I quickly hacked code like this to see if it could be made faster, but I don't understand the code well enough to know if it's correct (generally, `cur_p->sorted` was false, but I assume that refers to sorted by logprobs?), and, looking quickly, the output seemed ok:

```
static void llama_sampler_penalties_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
    auto * ctx = (llama_sampler_penalties *) smpl->ctx;

    if ((ctx->penalty_last_n == 0) ||
        (ctx->penalty_repeat == 1.0f && ctx->penalty_freq == 0.0f && ctx->penalty_present == 0.0f)) {
        return;
    }

    // START OPTIMIZED CODE PATH
    // Heuristic: If the array size matches the max token ID in the map,
    // and the first few tokens match their indices, we assume it is "Indexable".
    // This allows O(1) lookup instead of O(N) scanning.

    // Check if array is likely in raw "Logit Order" (Index == ID)
    bool is_indexable = (cur_p->size > 0 && cur_p->data[0].id == 0);

    // Extra safety check: verify the last element matches to be sure no one shuffled it
    if (is_indexable && cur_p->size > 1) {
         if (cur_p->data[cur_p->size - 1].id != (int)(cur_p->size - 1)) {
             is_indexable = false;
         }
    }

    if (is_indexable) {
        for (const auto & entry : ctx->token_count) {
            const llama_token id = entry.first;
            const int count = entry.second;

            // Boundary check (just in case)
            if (id < 0 || (size_t)id >= cur_p->size) continue;

            // Direct pointer access
            llama_token_data * token_data = &cur_p->data[id];

            // Apply penalties
            if (token_data->logit <= 0) {
                token_data->logit *= ctx->penalty_repeat;
            } else {
                token_data->logit /= ctx->penalty_repeat;
            }
            token_data->logit -= float(count) * ctx->penalty_freq + float(count > 0) * ctx->penalty_present;
        }

        cur_p->sorted = false;
        return;
    } else {
        LLAMA_LOG_WARN("llama_sampler_penalties_apply: cur_p is not indexable, falling back to slow path\n");
    }
    // END OPTIMIZED CODE PATH

    // Apply frequency and presence penalties to the cur_p, slow path
    for (size_t i = 0; i < cur_p->size; ++i) {
        const auto token_iter = ctx->token_count.find(cur_p->data[i].id);
        if (token_iter == ctx->token_count.end()) {
            continue;
        }

        const int count = token_iter->second;

        assert(count > 0 && count <= ctx->penalty_last_n);

        // The academic publication that described this technique actually just only divided, but that would cause tokens with negative logits to become more likely, which is obviously wrong.
        // This is common fix for this problem, which is to multiply by the penalty instead of dividing.
        if (cur_p->data[i].logit <= 0) {
            cur_p->data[i].logit *= ctx->penalty_repeat;
        } else {
            cur_p->data[i].logit /= ctx->penalty_repeat;
        }

        cur_p->data[i].logit -= float(count) * ctx->penalty_freq + float(count > 0) * ctx->penalty_present;
    }

    cur_p->sorted = false;
}
```

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: `llama_sampler_penalties_apply` is very inefficient #18556

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: llama_sampler_penalties_apply is very inefficient #18556

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Misc. bug: `llama_sampler_penalties_apply` is very inefficient #18556