Vision RoPE refactor by srishtiii28 · Pull Request #46542 · huggingface/transformers

srishtiii28 · 2026-06-10T15:45:37Z

VisionRotaryEmbedding.forward() across vision-language models was returning raw frequency tensors which was leaving the caller responsible for finalising the positional embedding via torch.cat((freqs, freqs), dim=-1).cos() / .sin(). This PR makes forward() return (cos, sin) directly which would make it consistent with how text RoPE modules already work.

Changes:

Base class (qwen2_vl): VisionRotaryEmbedding.forward(position_ids) now returns tuple[torch.Tensor, torch.Tensor]. The torch.cat + .cos() / .sin() logic is moved into the class.

Models updated:

qwen2_vl - base class
qwen2_5_vl - window_index pre-applied to position_ids before the RoPE call which is mathematically equivalent since RoPE is element-wise per position row
qwen3_vl - also removed a no-op reshape(seq_len, -1) on an already 2D tensor
qwen3_5 - same no-op reshape removed
video_llama_3, ernie4_5_vl_moe, glm4v, glm_ocr, paddleocr_vl

What is left out at the moment:

qwen2_5_omni - its vision attention uses a different interface entirely. VisionRotaryEmbedding.forward(seqlen: int) returns raw freqs that are then indexed by position. Changing this requires restructuring the vision attention, not just the RoPE class. I can maybe work on this if the reviewers give me a green signal.

mlcd - the caller prepends a learned class_pos_emb parameter to raw freqs before applying cos/sin. To make forward() return the final embedding including the class token, class_pos_emb would need to move into the RoPE class, which requires a weight rename in the conversion script. Again I can work on this if given a green signal.

llama4 - uses complex-domain RoPE (torch.view_as_complex) which has a different architecture altogether. Not the same pattern at all.

Secondary/derived models (exaone4_5, glm4v_moe, qwen3_5_moe, qwen3_vl_moe, qwen3_omni_moe) - these derive from the primary models we fixed. Their modular files have pass for the VisionRotaryEmbedding class and no override of VisionTransformer.forward.
The modular converter has a global name registry bug: qwen2_5_omni/modular_qwen2_5_omni.py defines class Qwen2_5_VisionRotaryEmbedding(Qwen2_5_VisionRotaryEmbedding) which pollutes the registry and causes derived classes in other models to pick up the seqlen: int forward instead of the position_ids one. Running the converter for these secondary models in the same batch produces inconsistent output for example: new RoPE API in the class, old cat+cos/sin pattern at the call site. Their generated files are currently self-consistent with the old API throughout, so I left them at HEAD.

make fix-repo - why i was not able to run it:

make fix-repo regenerated ALL modular models, reformated every file, synced doc TOCs, updated docstrings etc. Running it in full produced 200+ changed files across unrelated models which were mostly harmless formatter noise but also a few cases where the converter introduced bugs

Tests

The models changed in this PR don't have standalone test suites that run by default and most vision model tests are marked slow. The change is mechanical and provably correct: the (cos, sin) output from forward() is identical to what the call site was computing before which is torch.cat((freqs, freqs), dim=-1).cos() / .sin() which has just moved inside the class now.

I confirm that this is not a pure code agent PR.

github-actions · 2026-06-10T15:46:53Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm4v, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_vl, video_llama_3

Rocketknight1 · 2026-06-11T10:58:03Z

Being handled internally I think!

srishtiii28 and others added 12 commits June 9, 2026 23:16

qwen2_vl: VisionRotaryEmbedding.forward returns (cos, sin)

02d51b2

qwen2_5_vl: VisionRotaryEmbedding.forward returns (cos, sin)

a6285bc

qwen3_vl: VisionRotaryEmbedding.forward returns (cos, sin)

3affb1b

qwen3_5: VisionRotaryEmbedding.forward returns (cos, sin)

67de666

glm4v: VisionRotaryEmbedding.forward returns (cos, sin)

e19529b

glm_ocr: VisionRotaryEmbedding.forward returns (cos, sin)

81effeb

Merge branch 'huggingface:main' into vision-rope-refactor

6d4fdcc

ernie4_5_vl_moe: VisionRotaryEmbedding.forward returns (cos, sin)

ac193a8

Merge branch 'huggingface:main' into vision-rope-refactor

2152765

video_llama_3: VisionRotaryEmbedding.forward returns (cos, sin)

1a798aa

paddleocr_vl: VisionRotaryEmbedding.forward returns (cos, sin)

593e22f

Merge branch 'huggingface:main' into vision-rope-refactor

1494efe

Rocketknight1 closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision RoPE refactor#46542

Vision RoPE refactor#46542
srishtiii28 wants to merge 12 commits into
huggingface:mainfrom
srishtiii28:vision-rope-refactor

srishtiii28 commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Rocketknight1 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

srishtiii28 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Rocketknight1 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srishtiii28 commented Jun 10, 2026 •

edited

Loading