Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens

### Name and Version

b4380


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

single 3090 + i7 4930K

### Models

Llama-3_1-Nemotron-51B IQ3_S, IQ3_M, IQ4_XS, Q4_K_M from 
https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/

### Problem description & steps to reproduce

Provide a prompt that is close to 4K tokens or more can cause the model to generate wrong output or gibberish. Similar input to Qwen-2.5-Coder-32B.Q4_K_M.gguf gave me correct answers. Prompt shorter than 4K seems to work fine for me.

A sample command to reproduce the problem
./build/bin/llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct-GGUF/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf -p 'You are a helpful AI assistant.' -f prompt.txt -c 15156 -cnv -ngl 70

### First Bad Commit

Obviously, it happens from b4380. Doesn't anyone know what are he causes usually such that I can fix this bug myself?

### Relevant log output

```shell
This is typical bad reply from llama-cli to list top 10 interesting LLM papers based on their titles
---------
I ranked the papers based on how interesting their titles and abstracts sound. Here are the top ten most interesting sounding papers:

1. **A Survey on Model Compression for Large Language Models**
2. **A Survey on Transformer Compression**
3. **Survey on Transformer Compression**
4. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
5. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
6. **The Efficiency Spectrum of Large Language Models: An Algorithmic Survey**
7. **The Efficiency Spectrum of Large Language Models: An Algorithmic Survey**
8. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
9. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
10. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
11. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
12. **The Cost of Compression: Investigating the Impact of Compression on Parametric
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions