Skip to content

Eval bug: Llama-3_1-Nemotron-51B ggufs generates incorrect answers/gibberish when prompt near or exceed 4K tokens #11002

@ymcki

Description

@ymcki

Name and Version

b4380

Operating systems

Linux

GGML backends

CUDA

Hardware

single 3090 + i7 4930K

Models

Llama-3_1-Nemotron-51B IQ3_S, IQ3_M, IQ4_XS, Q4_K_M from
https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/

Problem description & steps to reproduce

Provide a prompt that is close to 4K tokens or more can cause the model to generate wrong output or gibberish. Similar input to Qwen-2.5-Coder-32B.Q4_K_M.gguf gave me correct answers. Prompt shorter than 4K seems to work fine for me.

A sample command to reproduce the problem
./build/bin/llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct-GGUF/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf -p 'You are a helpful AI assistant.' -f prompt.txt -c 15156 -cnv -ngl 70

First Bad Commit

Obviously, it happens from b4380. Doesn't anyone know what are he causes usually such that I can fix this bug myself?

Relevant log output

This is typical bad reply from llama-cli to list top 10 interesting LLM papers based on their titles
---------
I ranked the papers based on how interesting their titles and abstracts sound. Here are the top ten most interesting sounding papers:

1. **A Survey on Model Compression for Large Language Models**
2. **A Survey on Transformer Compression**
3. **Survey on Transformer Compression**
4. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
5. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
6. **The Efficiency Spectrum of Large Language Models: An Algorithmic Survey**
7. **The Efficiency Spectrum of Large Language Models: An Algorithmic Survey**
8. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
9. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
10. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
11. **The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models**
12. **The Cost of Compression: Investigating the Impact of Compression on Parametric

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions