Skip to content

Conversation

Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Sep 24, 2025

Purpose

  • Qwen3-VL's max_num_video_tokens calculation is implemented wrong, this PR corrects it

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <[email protected]>
@mergify mergify bot added qwen Related to Qwen models v1 labels Sep 24, 2025
@Isotr0py Isotr0py marked this pull request as ready for review September 24, 2025 08:16
@ywang96
Copy link
Member

ywang96 commented Sep 24, 2025

Should we update this as well?

_MAX_FRAMES_PER_VIDEO = 600

@Isotr0py
Copy link
Member Author

So should we make it configurable or just increase the value for it?

@ywang96
Copy link
Member

ywang96 commented Sep 24, 2025

So should we make it configurable or just increase the value for it?

Actually - let's make a new get_num_frames_with_most_features inside Qwen3-VL for now. I checked for Qwen2 and 2.5VL this number will be 14, so we can change this back to 14 but make a new constant inside Qwen3-VL with its get_num_frames_with_most_features referring to it

@ywang96
Copy link
Member

ywang96 commented Sep 24, 2025

This PR fixes the issue of mismatch between calculated number of tokens and actual number of tokens generated from ViT, but I'm getting these warnings.
(Worker_TP5 pid=585480) Token indices sequence length is longer than the specified maximum sequence length for this model (414874 > 262144). Running this sequence through the model will result in indexing errors

I think there's something wrong around how we call the HF processor that introduces this warning, which could be confusing to the user.

@Isotr0py Isotr0py changed the title [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for configurable video profiling Sep 25, 2025
@Isotr0py
Copy link
Member Author

This PR will be reworked for Qwen3-VL after #25631 merged.

@Isotr0py Isotr0py marked this pull request as draft September 25, 2025 08:04
@Isotr0py Isotr0py changed the title [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for configurable video profiling [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling Sep 25, 2025
Copy link

mergify bot commented Sep 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 27, 2025
@ywang96 ywang96 added this to the v0.11.0 Cherry Picks milestone Sep 27, 2025
@mergify mergify bot removed the needs-rebase label Sep 27, 2025
@ywang96
Copy link
Member

ywang96 commented Sep 27, 2025

#25810 should solve the proble of the big misbatch between ViT output length and video soft token length, so I'm going to update this PR accordingly

Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
@ywang96
Copy link
Member

ywang96 commented Sep 28, 2025

I've update several logics:

  • We've now decoupled the calculation of max video token from number of embeddings of ViT output and now include padding tokens in between them.
  • Based on Qwen3-VL default setting, the max pixel is recommended at 24576 * 32 * 32, and therefore the max number of frames should be lowered to 24576, which translates to a total number of 148634 tokens. This is not far from our estimation of 12288 * 12.5 = 153600, and I think have a bit of overestimation is okay.

@ywang96 ywang96 marked this pull request as ready for review September 28, 2025 00:26
Comment on lines +755 to +760
target_video_size, _ = self.info._get_vision_info(
image_width=target_width,
image_height=target_height,
num_frames=target_num_frames,
image_processor=self.info.get_video_processor(),
)
Copy link
Member

@ywang96 ywang96 Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in fact pretty important.

Previous we're sending a [32, 4096, 4096, 3] input tensor which would OOM if we turn on DP ViT, this is now corrected to [24576, 32, 32, 3]

Signed-off-by: Roger Wang <[email protected]>
@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 28, 2025
@ywang96 ywang96 enabled auto-merge (squash) September 28, 2025 04:04
@ywang96 ywang96 merged commit 0efd540 into vllm-project:main Sep 28, 2025
50 checks passed
simon-mo pushed a commit that referenced this pull request Sep 28, 2025
…le video profiling (#25557)

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: simon-mo <[email protected]>
@Isotr0py Isotr0py deleted the fix-qwen3-video-profiling branch September 28, 2025 07:57
baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Sep 28, 2025
…le video profiling (vllm-project#25557)

Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: baonudesifeizhai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants