Skip to content

Feature Request: Add SenseVoice/Paraformer as faster ASR option #374

@LauraGPT

Description

@LauraGPT

Feature Request

whisper-diarization is excellent for ASR + speaker diarization. Suggesting SenseVoice / Paraformer as alternative ASR backends — they're much faster and FunASR includes its own diarization model.

Why SenseVoice/Paraformer?

  • 5-10x faster than Whisper — non-autoregressive, dramatically reduces processing time for long recordings
  • Built-in speaker diarization — FunASR includes cam++ (7.2M params) for speaker embedding, no separate pyannote needed
  • Built-in VAD — FSMN-VAD with accurate timestamps
  • Built-in punctuation — automatic punctuation restoration
  • 50+ languages — SenseVoice handles multilingual content

Complete pipeline comparison

Current: Whisper + pyannote → post-processing alignment
Alternative: FunASR (SenseVoice + FSMN-VAD + cam++ + CT-Punc) → all-in-one

from funasr import AutoModel

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    spk_model="cam++",
    punc_model="ct-punc",
)
result = model.generate(input="meeting.wav")
# Returns: text + timestamps + speaker labels + punctuation

Speed benefit for long recordings

For a 1-hour recording:

  • Whisper large-v3: ~10-15 minutes processing
  • SenseVoice: ~2-3 minutes processing (5x faster)

Install: pip install funasr

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions