Skip to content

Fix regression in ProcessorMixin._load_tokenizer_from_pretrained for tokenizers at root#46592

Open
punyamodi wants to merge 1 commit into
huggingface:mainfrom
punyamodi:fix-tokenizer-subfolder-fallback
Open

Fix regression in ProcessorMixin._load_tokenizer_from_pretrained for tokenizers at root#46592
punyamodi wants to merge 1 commit into
huggingface:mainfrom
punyamodi:fix-tokenizer-subfolder-fallback

Conversation

@punyamodi

Copy link
Copy Markdown

Summary

In transformers v5.x, a change was introduced to automatically look in subfolder directories named after the sub-processor attribute when loading additional/non-primary tokenizers (e.g. searching for files in bpe_tokenizer/ when the sub-processor name is bpe_tokenizer).

This regression breaks loading for older/existing model repositories where tokenizer files are placed at the root of the repository, but the sub-processor attribute name is configured to something else (for example, the UniversalActionProcessor of physical-intelligence/fast which uses bpe_tokenizer). When attempting to load such processors using AutoProcessor.from_pretrained(), it fails with a ValueError because it cannot locate files in the subfolder.

This PR wraps the subfolder loading in a try-except block. If loading from the subfolder fails, it gracefully logs a deprecation warning and falls back to loading from the root of the repository (or the passed subfolder directory).

Testing

Verified by loading physical-intelligence/fast with AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True) and confirming it successfully falls back to root and loads the processor.

Copilot AI review requested due to automatic review settings June 12, 2026 08:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a fallback path for loading an “additional tokenizer” when the expected subfolder load fails, while warning that this fallback behavior is deprecated.

Changes:

  • Wrap from_pretrained(..., subfolder=tokenizer_subfolder) in a try/except and fall back to subfolder=subfolder on failure
  • Emit a warning message indicating the fallback is deprecated

Comment thread src/transformers/processing_utils.py Outdated
Comment thread src/transformers/processing_utils.py
@punyamodi punyamodi force-pushed the fix-tokenizer-subfolder-fallback branch from 7a4a264 to 39ffe6d Compare June 12, 2026 08:43
@punyamodi punyamodi force-pushed the fix-tokenizer-subfolder-fallback branch from 39ffe6d to 8e3e062 Compare June 12, 2026 08:49
@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@Rocketknight1

Copy link
Copy Markdown
Member

cc @ArthurZucker @itazap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants