`AssertionError: a != <star>` during forced alignment on Hebrew transcription (`ctc_forced_aligner` crash)

When running `diarize.py` with a Hebrew Whisper model (fine-tuned Whisper CTranslate2 architecture `ivrit-ai/whisper-large-v3-turbo-ct2`), the transcription phase finishes, but the pipeline crashes during the forced alignment phase inside `ctc_forced_aligner`.

It throws an `AssertionError: a != <star>` in `alignment_utils.py` while trying to map the generated tokens/characters to the timestamp segments. This seems to be caused by a mismatch between Hebrew character tokenization and the expected special tokens (like `<star>`) in the aligner.
  
The error during the `get_spans` execution phase.**Error Logs/Stderr**
```text
/work/whisper-diarization/venv/lib/python3.12/site-packages/torchaudio/__init__.py:178: UserWarning: The 'encoding' parameter is not fully supported by TorchCodec AudioEncoder.
  return save_with_torchcodec(
/work/whisper-diarization/venv/lib/python3.12/site-packages/torchaudio/__init__.py:178: UserWarning: The 'bits_per_sample' parameter is not directly supported by TorchCodec AudioEncoder.
  return save_with_torchcodec(
Traceback (most recent call last):
  File "/work/whisper-diarization/diarize.py", line 183, in <module>
    spans = get_spans(tokens_starred, segments, blank_token)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/whisper-diarization/venv/lib/python3.12/site-packages/ctc_forced_aligner/alignment_utils.py", line 62, in get_spans
    assert seg.label == ltr, f"{seg.label} != {ltr}"
           ^^^^^^^^^^^^^^^^
AssertionError: a != <star>
```

**Environment:**
* **OS:** Ubuntu 24.04 / Linux
* **Python Version:** 3.12
* **`faster-whisper` Version:** 1.2.1
* **`ctranslate2` Version:** 4.7.2
* **Model Used:** `ivrit-ai/whisper-large-v3-turbo-ct2` (Hebrew)

**Expected behavior**
The alignment engine should gracefully handle non-Latin characters or ignore unknown structural tokens instead of throwing a hard assertion error, allowing the speaker diarization phase to complete.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AssertionError: a != <star>` during forced alignment on Hebrew transcription (`ctc_forced_aligner` crash) #375

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AssertionError: a != <star> during forced alignment on Hebrew transcription (ctc_forced_aligner crash) #375

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`AssertionError: a != <star>` during forced alignment on Hebrew transcription (`ctc_forced_aligner` crash) #375