Vision RoPE refactor#46542
Closed
srishtiii28 wants to merge 12 commits into
Closed
Conversation
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_vl_moe, glm4v, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_vl, video_llama_3 |
Member
|
Being handled internally I think! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #46443
VisionRotaryEmbedding.forward()across vision-language models was returning raw frequency tensors which was leaving the caller responsible for finalising the positional embedding viatorch.cat((freqs, freqs), dim=-1).cos() / .sin(). This PR makesforward()return (cos, sin)directly which would make it consistent with how text RoPE modules already work.Changes:
Base class (qwen2_vl):
VisionRotaryEmbedding.forward(position_ids)now returnstuple[torch.Tensor, torch.Tensor]. Thetorch.cat + .cos() / .sin()logic is moved into the class.Models updated:
reshape(seq_len, -1)on an already 2D tensorWhat is left out at the moment:
qwen2_5_omni - its vision attention uses a different interface entirely.
VisionRotaryEmbedding.forward(seqlen: int)returns raw freqs that are then indexed by position. Changing this requires restructuring the vision attention, not just the RoPE class. I can maybe work on this if the reviewers give me a green signal.mlcd - the caller prepends a learned
class_pos_embparameter to raw freqs before applying cos/sin. To makeforward()return the final embedding including the class token,class_pos_embwould need to move into the RoPE class, which requires a weight rename in the conversion script. Again I can work on this if given a green signal.llama4 - uses complex-domain RoPE (
torch.view_as_complex) which has a different architecture altogether. Not the same pattern at all.Secondary/derived models (exaone4_5, glm4v_moe, qwen3_5_moe, qwen3_vl_moe, qwen3_omni_moe) - these derive from the primary models we fixed. Their modular files have pass for the VisionRotaryEmbedding class and no override of VisionTransformer.forward.
The modular converter has a global name registry bug:
qwen2_5_omni/modular_qwen2_5_omni.pydefines classQwen2_5_VisionRotaryEmbedding(Qwen2_5_VisionRotaryEmbedding)which pollutes the registry and causes derived classes in other models to pick up the seqlen: int forward instead of the position_ids one. Running the converter for these secondary models in the same batch produces inconsistent output for example: new RoPE API in the class, old cat+cos/sin pattern at the call site. Their generated files are currently self-consistent with the old API throughout, so I left them at HEAD.make fix-repo- why i was not able to run it:make fix-repo regenerated ALL modular models, reformated every file, synced doc TOCs, updated docstrings etc. Running it in full produced 200+ changed files across unrelated models which were mostly harmless formatter noise but also a few cases where the converter introduced bugs
Tests
The models changed in this PR don't have standalone test suites that run by default and most vision model tests are marked slow. The change is mechanical and provably correct: the (cos, sin) output from forward() is identical to what the call site was computing before which is torch.cat((freqs, freqs), dim=-1).cos() / .sin() which has just moved inside the class now.
I confirm that this is not a pure code agent PR.