Skip to content

BLIPs clean-up #35560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 29, 2025
Merged

BLIPs clean-up #35560

merged 18 commits into from
Jul 29, 2025

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jan 8, 2025

What does this PR do?

Continuation of #34502, which I separated out because BLIP wasn't ready to be merged in v4.48 release. The main reason is Blip2Retrieval model whose hub configs weren't updated

Main changes:

  • Clean up the if/else checks and leave only the new logic with expanded inputs
  • Change the Blip2Retrieval model class to accept expanded inputs (open to feedback). Currently it expects non-expanded inputs to do ITM so we just crop the inputs removing extra image palceholders
  • Opened PRs on the hub for the two checkpoints for Blip2Retrieval, will merge right after this PR
  • The tokenizer's "max_length" now includes the image tokens which is breaking but is in line with other VLMs. I would like to keep the change even if breaking to be consistent, or should we wait for v5.0?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

down to have this still, but up to you it's been a while sorry for my late review

@zucchini-nlp
Copy link
Member Author

+1, should have this so we can remove deprecated code. I will resolve conflicts and see if anything else needs to be done

@zucchini-nlp
Copy link
Member Author

run-slow: blip_2, instructblip, instructblipvideo

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/blip_2', 'models/instructblip', 'models/instructblipvideo']
quantizations: [] ...

Comment on lines +324 to +329
# Add imports via `define_import_structure` after the #35167 as we remove explicit import in `__init__.py`
from transformers.utils.import_utils import define_import_structure

reversed_structure = {}
new_imported_modules_from_import_structure = define_import_structure("src/transformers/__init__.py")
for mapping in new_imported_modules_from_import_structure.values():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this fails only after few months after __init__ refactor, prob no-one updated tiny_model_summary.json after that

Without it the script fails to fetch tests that are modified in tiny_model_summary.json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ydshieh !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the tiny mdoel creation script is failing (on github) and it has been sometime no one (i.e. me) updating the relevant stuff in the codebase.

Will try to work on that

@zucchini-nlp
Copy link
Member Author

run-slow: blip_2, instructblip, instructblipvideo

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/blip_2', 'models/instructblip', 'models/instructblipvideo']
quantizations: [] ...

@zucchini-nlp
Copy link
Member Author

run-slow: blip_2, instructblip, instructblipvideo

Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/blip_2', 'models/instructblip', 'models/instructblipvideo']
quantizations: [] ...

Comment on lines +1564 to +1576
if input_ids is None:
special_image_mask = inputs_embeds == self.get_input_embeddings()(
torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
)
special_image_mask = special_image_mask.all(-1)
else:
special_image_mask = input_ids == self.config.image_token_id

special_image_mask = special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(language_model_inputs.device)
language_model_inputs = language_model_inputs.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.to(language_model_inputs.device).masked_scatter(
special_image_mask, language_model_inputs
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a small function for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can isolate the special_image_mask obtaining step in a tiny finction, it is a recurring pattern in all VLMs after the last big PR I merged for supporting inputs embeds

Taking it as a note for subsequent PR

Comment on lines +324 to +329
# Add imports via `define_import_structure` after the #35167 as we remove explicit import in `__init__.py`
from transformers.utils.import_utils import define_import_structure

reversed_structure = {}
new_imported_modules_from_import_structure = define_import_structure("src/transformers/__init__.py")
for mapping in new_imported_modules_from_import_structure.values():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ydshieh !

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: blip_2, instructblip, instructblipvideo

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise, but happy if we isolate this in a function along with this PR!

@zucchini-nlp
Copy link
Member Author

Let's merge it then, I will make a PR to isolate get_image_mask in a tiny separate fn

@zucchini-nlp zucchini-nlp merged commit 7579479 into huggingface:main Jul 29, 2025
20 checks passed
@@ -626,7 +626,7 @@
"model_classes": [
"Blip2ForConditionalGeneration"
],
"sha": "35e1ef43da3554af62eb29a7b3dbbef3f3bef48e"
"sha": "d0de11fd1f8ca481231c07ee0934924be96cb281"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants