Skip to content

Vision RoPE refactor#46542

Closed
srishtiii28 wants to merge 12 commits into
huggingface:mainfrom
srishtiii28:vision-rope-refactor
Closed

Vision RoPE refactor#46542
srishtiii28 wants to merge 12 commits into
huggingface:mainfrom
srishtiii28:vision-rope-refactor

Conversation

@srishtiii28

@srishtiii28 srishtiii28 commented Jun 10, 2026

Copy link
Copy Markdown

Closes #46443

VisionRotaryEmbedding.forward() across vision-language models was returning raw frequency tensors which was leaving the caller responsible for finalising the positional embedding via torch.cat((freqs, freqs), dim=-1).cos() / .sin(). This PR makes forward() return (cos, sin) directly which would make it consistent with how text RoPE modules already work.

Changes:

Base class (qwen2_vl): VisionRotaryEmbedding.forward(position_ids) now returns tuple[torch.Tensor, torch.Tensor]. The torch.cat + .cos() / .sin() logic is moved into the class.

Models updated:

  • qwen2_vl - base class
  • qwen2_5_vl - window_index pre-applied to position_ids before the RoPE call which is mathematically equivalent since RoPE is element-wise per position row
  • qwen3_vl - also removed a no-op reshape(seq_len, -1) on an already 2D tensor
  • qwen3_5 - same no-op reshape removed
  • video_llama_3, ernie4_5_vl_moe, glm4v, glm_ocr, paddleocr_vl

What is left out at the moment:

qwen2_5_omni - its vision attention uses a different interface entirely. VisionRotaryEmbedding.forward(seqlen: int) returns raw freqs that are then indexed by position. Changing this requires restructuring the vision attention, not just the RoPE class. I can maybe work on this if the reviewers give me a green signal.

mlcd - the caller prepends a learned class_pos_emb parameter to raw freqs before applying cos/sin. To make forward() return the final embedding including the class token, class_pos_emb would need to move into the RoPE class, which requires a weight rename in the conversion script. Again I can work on this if given a green signal.

llama4 - uses complex-domain RoPE (torch.view_as_complex) which has a different architecture altogether. Not the same pattern at all.

Secondary/derived models (exaone4_5, glm4v_moe, qwen3_5_moe, qwen3_vl_moe, qwen3_omni_moe) - these derive from the primary models we fixed. Their modular files have pass for the VisionRotaryEmbedding class and no override of VisionTransformer.forward.
The modular converter has a global name registry bug: qwen2_5_omni/modular_qwen2_5_omni.py defines class Qwen2_5_VisionRotaryEmbedding(Qwen2_5_VisionRotaryEmbedding) which pollutes the registry and causes derived classes in other models to pick up the seqlen: int forward instead of the position_ids one. Running the converter for these secondary models in the same batch produces inconsistent output for example: new RoPE API in the class, old cat+cos/sin pattern at the call site. Their generated files are currently self-consistent with the old API throughout, so I left them at HEAD.

make fix-repo - why i was not able to run it:

make fix-repo regenerated ALL modular models, reformated every file, synced doc TOCs, updated docstrings etc. Running it in full produced 200+ changed files across unrelated models which were mostly harmless formatter noise but also a few cases where the converter introduced bugs

Tests

The models changed in this PR don't have standalone test suites that run by default and most vision model tests are marked slow. The change is mechanical and provably correct: the (cos, sin) output from forward() is identical to what the call site was computing before which is torch.cat((freqs, freqs), dim=-1).cos() / .sin() which has just moved inside the class now.

I confirm that this is not a pure code agent PR.

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm4v, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_vl, video_llama_3

@Rocketknight1

Copy link
Copy Markdown
Member

Being handled internally I think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vision RoPE refactor

2 participants