Skip to content

Fix Wav2Vec2 word delimiter special token handling#46578

Open
nightcityblade wants to merge 1 commit into
huggingface:mainfrom
nightcityblade:fix/issue-46552
Open

Fix Wav2Vec2 word delimiter special token handling#46578
nightcityblade wants to merge 1 commit into
huggingface:mainfrom
nightcityblade:fix/issue-46552

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Fixes #46552.\n\nThis keeps the Wav2Vec2 CTC word delimiter from being treated as removable special token by special-token-aware APIs, matching the decode path behavior.\n\nChanges:\n- Preserve the word delimiter in convert_ids_to_tokens(..., skip_special_tokens=True).\n- Exclude the word delimiter from get_special_tokens_mask(..., already_has_special_tokens=True).\n- Add a regression test for delimiter masking/skipping.\n\nTests:\n- python -m pytest tests/models/wav2vec2/test_tokenization_wav2vec2.py -q

@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: wav2vec2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wav2Vec2CTCTokenizer in v5 treats the word delimiter as a special token, which leaks into get_special_tokens_mask and convert_ids_to_tokens

1 participant