Skip to content

Llama 2 7B output differs from Hugging Face #746

@galopyz

Description

@galopyz

Bug description

Hello, I was following ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb notebook, and I found out that the output from weight loaded Llama 2 7B was different from Hugging Face's version. I used greedy decoding to generate the response on both the notebook code and Hugging Face's. To make sure I was using the correct decoding settings on Hugging Face, I tried out with gpt2 model and both Hugging Face and the notebook code outputs matched.

Here are the comparison of the outputs using the same seed and greedy decoding.
Because Hugging Face tokenizer adds (begging of sequence) token in front of the sequence for llama2 models, I added it and generated the following:

 Every effort has been made to ensure that the information contained in this website is accurate and up to date and correct at the time of publication

Here is Hugging Face output with a little bit more tokens:

Every effort has been made to ensure that the information contained in this website is accurate and up to date. However, the information is provided without any warranty, express or implied, as to the accuracy

Their outputs are equivalent up to certain tokens and diverge.

What operating system are you using?

Linux

Where do you run your code?

Local (laptop, desktop)

Environment

[OK] Your Python version is 3.12.9
2025-07-22 00:22:31.985392: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1753161752.316201    7968 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753161752.409691    7968 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1753161753.247778    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753161753.247806    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753161753.247810    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753161753.247814    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-07-22 00:22:33.321338: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[OK] torch 2.7.0+cu126
[OK] jupyterlab 4.3.5
[OK] tiktoken 0.9.0
[OK] matplotlib 3.10.1
[OK] tensorflow 2.19.0
[OK] tqdm 4.67.1
[FAIL] numpy 2.2.6, please install a version matching <2.1,>=1.26
[OK] pandas 2.2.3
[OK] psutil 6.1.1



Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions