Llama 2 7B output differs from Hugging Face

### Bug description

Hello, I was following `ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb` notebook, and I found out that the output from weight loaded Llama 2 7B was different from Hugging Face's version. I used greedy decoding to generate the response on both the notebook code and Hugging Face's. To make sure I was using the correct decoding settings on Hugging Face, I tried out with gpt2 model and both Hugging Face and the notebook code outputs matched.

Here are the comparison of the outputs using the same seed and greedy decoding.
Because Hugging Face tokenizer adds <bos> (begging of sequence) token in front of the sequence for llama2 models, I added it and generated the following: 
```
 Every effort has been made to ensure that the information contained in this website is accurate and up to date and correct at the time of publication
```
Here is Hugging Face output with a little bit more tokens:
```
Every effort has been made to ensure that the information contained in this website is accurate and up to date. However, the information is provided without any warranty, express or implied, as to the accuracy
```

Their outputs are equivalent up to certain tokens and diverge.

### What operating system are you using?

Linux

### Where do you run your code?

Local (laptop, desktop)

### Environment

```
[OK] Your Python version is 3.12.9
2025-07-22 00:22:31.985392: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1753161752.316201    7968 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753161752.409691    7968 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1753161753.247778    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753161753.247806    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753161753.247810    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753161753.247814    7968 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-07-22 00:22:33.321338: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[OK] torch 2.7.0+cu126
[OK] jupyterlab 4.3.5
[OK] tiktoken 0.9.0
[OK] matplotlib 3.10.1
[OK] tensorflow 2.19.0
[OK] tqdm 4.67.1
[FAIL] numpy 2.2.6, please install a version matching <2.1,>=1.26
[OK] pandas 2.2.3
[OK] psutil 6.1.1



```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama 2 7B output differs from Hugging Face #746

Bug description

What operating system are you using?

Where do you run your code?

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Llama 2 7B output differs from Hugging Face #746

Description

Bug description

What operating system are you using?

Where do you run your code?

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions