Feature Request
whisper-diarization is excellent for ASR + speaker diarization. Suggesting SenseVoice / Paraformer as alternative ASR backends — they're much faster and FunASR includes its own diarization model.
Why SenseVoice/Paraformer?
- 5-10x faster than Whisper — non-autoregressive, dramatically reduces processing time for long recordings
- Built-in speaker diarization — FunASR includes cam++ (7.2M params) for speaker embedding, no separate pyannote needed
- Built-in VAD — FSMN-VAD with accurate timestamps
- Built-in punctuation — automatic punctuation restoration
- 50+ languages — SenseVoice handles multilingual content
Complete pipeline comparison
Current: Whisper + pyannote → post-processing alignment
Alternative: FunASR (SenseVoice + FSMN-VAD + cam++ + CT-Punc) → all-in-one
from funasr import AutoModel
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
spk_model="cam++",
punc_model="ct-punc",
)
result = model.generate(input="meeting.wav")
# Returns: text + timestamps + speaker labels + punctuation
Speed benefit for long recordings
For a 1-hour recording:
- Whisper large-v3: ~10-15 minutes processing
- SenseVoice: ~2-3 minutes processing (5x faster)
Install: pip install funasr
Feature Request
whisper-diarization is excellent for ASR + speaker diarization. Suggesting SenseVoice / Paraformer as alternative ASR backends — they're much faster and FunASR includes its own diarization model.
Why SenseVoice/Paraformer?
Complete pipeline comparison
Current: Whisper + pyannote → post-processing alignment
Alternative: FunASR (SenseVoice + FSMN-VAD + cam++ + CT-Punc) → all-in-one
Speed benefit for long recordings
For a 1-hour recording:
Install:
pip install funasr