-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Fix RoPE implementation #747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I wonder if the origin of the problem isn't coming more broadly from the difference in weights loading and the RoPE variant used, explained in this issue: huggingface/transformers#25199 Sebastian is loading the weights directly vs HF permuting Q and K when converting and then use the 2 halves variant. It could explain why switching to your interleaved variant is the right one to use here iiuc. |
Thanks for the PR. I agree that there could be a bug. What what's weird is that the unit test comparing the RoPE calculations to 2 reference implementations (LitGPT and HF transformers) gave the same results. Maybe there was an edge case. It looks like that there is no some issue with the tests after the fix:
Could you update the PR? |
Sorry, I was not aware of the tests. After going through the tests, I found out that LitGPT implementation matches with Hugging Face implementation. However, they do not match with torchtune or llama2 implementation. Here is a google colab notebook with comparisons. I am not sure why they do not match. But using torchtune's RotaryPositionalEmbeddings gave me the same output from Hugging Face. Would it be okay to change the test to compare the implementation with torchtune? |
Thanks for looking into that! Honestly, I really appreciate your time here fixing the RoPE issues. I remember spending a lot of time debugging things back then... I haven't had a time to carefully double-check the reference implementations this morning (and the last time I checked was like a year ago when I wrote the original Llama code here), I I may be missing something or don't understand correctly, yet. But that being said, regarding
I think the reason is the LitGPT implementation is a general-purpose implementation (developed before torchtune) that works with all kinds of LLMs, not just Llama. In fact, torchtune copied many aspects from LitGPT but they may have implemented the RoPE in their own way. (With copied, I mean that LitGPT was around first, and torchtune was developed 1-2 years later trying to mimic the LitGPT API; you can see it when searching for So maybe torchtune has a correct implementation here whereas the LitGPT project hasn't.
So when I understand correctly,
That part I find a bit confusing, because how can the torchtune RoPE match the HuggingFace one but not the LitGPT one even though LitGPT and Hugging Face both match the RoPE in this repository in the tests.
That would be okay with me, but I think we always need a 2nd reference here like Hugging Face to ensure consensus. |
Sorry about the confusion. I changed the access to the colab notebook so you can read it. Here is the result from the notebook in a summary.
Regarding
What I meant to say about this is that by changing the RoPE implementation to match with torchtune, the output from the model (genenrated text) match with Hugging Face's generated text. I should have been more explicit about what kind of output I was referring to. |
Thanks for clarifying, I think I understand now. |
As @casinca menteiond, HuggingFace uses Llama2 7B does not use group query attention, so we can rule out that case. I have tried loading weights from It could be possible that LitGPT uses hugging face transformer version of the model weights. |
Ohhh, I see now. Yes LitGPT uses the hugging face weights. But in this case, could we not just permute the queries and keys instead of swapping the RoPE? Similar to what Hugging Face did: n_heads = LLAMA2_CONFIG_7B["n_heads"]
dim = LLAMA2_CONFIG_7B["emb_dim"]
def permute(w, n_heads=n_heads, dim1=dim, dim2=dim):
return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)
def assign(left, right):
if left.shape != right.shape:
raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
if isinstance(right, torch.Tensor):
return torch.nn.Parameter(right.clone().detach())
else:
return torch.nn.Parameter(torch.tensor(right))
def load_weights_into_llama(model, param_config, params):
model.tok_emb.weight = assign(model.tok_emb.weight, params["tok_embeddings.weight"])
for l in range(param_config["n_layers"]):
# Load attention weights
model.trf_blocks[l].att.W_query.weight = assign(
model.trf_blocks[l].att.W_query.weight,
permute(params[f"layers.{l}.attention.wq.weight"]) # NEW
)
model.trf_blocks[l].att.W_key.weight = assign(
model.trf_blocks[l].att.W_key.weight,
permute(params[f"layers.{l}.attention.wk.weight"]) # NEW
)
model.trf_blocks[l].att.W_value.weight = assign(
model.trf_blocks[l].att.W_value.weight,
params[f"layers.{l}.attention.wv.weight"]
)
model.trf_blocks[l].att.out_proj.weight = assign(
model.trf_blocks[l].att.out_proj.weight,
params[f"layers.{l}.attention.wo.weight"]
)
model.trf_blocks[l].norm1.weight = assign(
model.trf_blocks[l].norm1.weight,
params[f"layers.{l}.attention_norm.weight"]
)
# Load FeedForward weights
model.trf_blocks[l].ff.fc1.weight = assign(
model.trf_blocks[l].ff.fc1.weight,
params[f"layers.{l}.feed_forward.w1.weight"]
)
# For some reason w2 and w3 are provided in the wrong order in the weights file
model.trf_blocks[l].ff.fc2.weight = assign(
model.trf_blocks[l].ff.fc2.weight,
params[f"layers.{l}.feed_forward.w3.weight"]
)
model.trf_blocks[l].ff.fc3.weight = assign(
model.trf_blocks[l].ff.fc3.weight,
params[f"layers.{l}.feed_forward.w2.weight"]
)
model.trf_blocks[l].norm2.weight = assign(
model.trf_blocks[l].norm2.weight,
params[f"layers.{l}.ffn_norm.weight"]
)
# Load output layer weights
model.final_norm.weight = assign(model.final_norm.weight, params["norm.weight"])
model.out_head.weight = assign(model.out_head.weight, params["output.weight"])
load_weights_into_llama(model, LLAMA2_CONFIG_7B, weights)
model.to(device); |
I just gave it a quick try and it seems to work. I added it as a separate PR in #750 so you can check out the file diffs via ReviewNB. It looks like we are now getting almost identical results: Base model:
Note that the last word is different in 2 & 3. Chat model
Still, there is 1 word different in the base model. |
That worked nicely! The one word difference might be from hugging face tokenizer adding 1 at the beginning of the sequence. Adding 1 manually at the beginning and generating text resulted the exact same text as the Hugging Face transformer. Here are the outputs with and without bos token vs. Hugging Face model output. With 1 as a bos token:
WIthout bos token:
HF output:
|
Awesome. Glad that it all works correctly now. Thanks so much for the valuable discussion and contribution! (I will merge the other PR then, it's a bit easier this way than changing the RoPE code so that the RoPE code can be reused for Llama 3 etc.) |
That's great. I am glad we fixed this very subtle bug. I appreciate your feedback and discussions. I learned a lot. |
This is a PR to fix a bug in #746.
I found out that the output of RoPE did not match torchtune's RotaryPositionalEmbeddings or llama's RoPE.
After making changes to the RoPE implementation, the output from both Hugging Face and the notebook matched.
Here are the changes. Instead of dividing head into first half and second half by indexing up to half point, I indexed by even and odd. This is because even indexes are multiplied by cos and odd with sin. Then, I apply the following formula:
to apply RoPE transformation.
I tried to follow your style of keeping cos and sin instead of converting them into complex numbers.
I also found out that instruction-fintuned model weights were not getting loaded, so I added it.
Please let me know if some parts are unclear or needs changes. Thank you.