Skip to content

AssertionError: a != <star> during forced alignment on Hebrew transcription (ctc_forced_aligner crash) #375

@yaakovEntin

Description

@yaakovEntin

When running diarize.py with a Hebrew Whisper model (fine-tuned Whisper CTranslate2 architecture ivrit-ai/whisper-large-v3-turbo-ct2), the transcription phase finishes, but the pipeline crashes during the forced alignment phase inside ctc_forced_aligner.

It throws an AssertionError: a != <star> in alignment_utils.py while trying to map the generated tokens/characters to the timestamp segments. This seems to be caused by a mismatch between Hebrew character tokenization and the expected special tokens (like <star>) in the aligner.

The error during the get_spans execution phase.Error Logs/Stderr

/work/whisper-diarization/venv/lib/python3.12/site-packages/torchaudio/__init__.py:178: UserWarning: The 'encoding' parameter is not fully supported by TorchCodec AudioEncoder.
  return save_with_torchcodec(
/work/whisper-diarization/venv/lib/python3.12/site-packages/torchaudio/__init__.py:178: UserWarning: The 'bits_per_sample' parameter is not directly supported by TorchCodec AudioEncoder.
  return save_with_torchcodec(
Traceback (most recent call last):
  File "/work/whisper-diarization/diarize.py", line 183, in <module>
    spans = get_spans(tokens_starred, segments, blank_token)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/whisper-diarization/venv/lib/python3.12/site-packages/ctc_forced_aligner/alignment_utils.py", line 62, in get_spans
    assert seg.label == ltr, f"{seg.label} != {ltr}"
           ^^^^^^^^^^^^^^^^
AssertionError: a != <star>

Environment:

  • OS: Ubuntu 24.04 / Linux
  • Python Version: 3.12
  • faster-whisper Version: 1.2.1
  • ctranslate2 Version: 4.7.2
  • Model Used: ivrit-ai/whisper-large-v3-turbo-ct2 (Hebrew)

Expected behavior
The alignment engine should gracefully handle non-Latin characters or ignore unknown structural tokens instead of throwing a hard assertion error, allowing the speaker diarization phase to complete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions