-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Custom masking utils for Gemma3 VLM #5853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b3a2e3b
to
d79f8d6
Compare
9e7e465
to
fe2237d
Compare
/bot run |
PR_Github #11341 [ run ] triggered by Bot |
RE: image tokens can appear anywhere in the input_ids |
Thank you, @schetlur-nv! I've noticed where the image tokens start could vary based on length of system prompt (more text tokens before image tokens for longer system prompt), use single image or multiple images (second image's tokens start after the first one). But, it's true that the number of 262144 is
|
I think chunked prefill may work if we respect the image token boundaries — that is, each multimodal item (e.g., an image) must occupy a single chunk @brb-nv Regarding your comment on KV cache reuse — currently, kvcache reuse is not supported for all multimodal models in PyTorch flow (see here). I have a pending PR #5444 to enable it initially. Also, I think your assumption — either all tokens of an image must be reused or none at all — might not always hold. For example, consider the input:
Now, if another sequence shares the same image but ends slightly differently, e.g., [1, 2, image_token, image_token, image_token, image_token, 11, 13], we can only reuse Block 1, which only partially covers the image tokens. |
Thank you, @chang-l! The example you have is exactly why I was saying |
PR_Github #11341 [ run ] completed with state |
c6d8db8
to
163b097
Compare
/bot run --disable-fail-fast |
PR_Github #11411 [ run ] triggered by Bot |
PR_Github #11411 [ run ] completed with state |
/bot run |
PR_Github #11448 [ run ] triggered by Bot |
06c354b
to
b7d62a6
Compare
/bot run --disable-fail-fast |
PR_Github #11453 [ run ] triggered by Bot |
PR_Github #11448 [ run ] completed with state |
PR_Github #11453 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, great work.
f4bae6e
to
355389c
Compare
/bot reuse-pipeline |
Reusing pipeline because changes are cosmetic. f4bae6e I reran formatting too. |
PR_Github #11468 [ reuse-pipeline ] triggered by Bot |
Signed-off-by: Balaram Buddharaju <[email protected]>
355389c
to
5908b24
Compare
/bot reuse |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot reuse-pipeline |
PR_Github #11469 [ reuse-pipeline ] triggered by Bot |
PR_Github #11468 [ reuse-pipeline ] completed with state |
/bot reuse-pipeline |
PR_Github #11470 [ reuse-pipeline ] triggered by Bot |
PR_Github #11469 [ reuse-pipeline ] completed with state |
PR_Github #11470 [ reuse-pipeline ] completed with state |
Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Yuxin <[email protected]>
Description
This MR:
custom_mask
usage in FlashInfer backend.Background about custom mask:
get_flashinfer_attention_mask
will only be called for a batch when there's at least one context request in the batch with image tokens.input_ids
may have a mix ofimage
(image_token_idx) andtext
tokens where tokens corresponding to an image appear as a contiguous blob.Example: torch.IntTensor([2, 3, 4, 5, img_idx, img_idx, img_idx, ..., img_idx, 100])
This MR has a nice visualization of the attention mask for global attention and sliding window attention:
huggingface/transformers#38295
Request for reviewers:
I'd appreciate your comments on the two strong assumptions of disabling chunked prefill and KV Cache reuse to get the bidirectional masking right. My thoughts:
input_ids
and bidirectionality will be lost if chunking breaks an image input blob into separate chunks.Test Coverage
Following tests validate masking utils.
Following tests validate masking utils as well as custom mask usage by FlashInfer backend.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.