Skip to content

Conversation

@LiangLuDev
Copy link

Error info:
Failed to synthesize speech: Size (273) of dimension (1) is not in allowed range (2..240)


Problem

The TTS models have a strict input length constraint of 2-240 tokens. When input text exceeded this limit, synthesis would fail with: Size (273) of dimension (1) is not in allowed range
(2..240)

Solution

Implemented a multi-level text segmentation strategy that automatically handles long input text:

Three-tier fallback mechanism:

  1. Sentence-level segmentation - Split by sentence boundaries (.!?。!?)
  2. Comma-level segmentation - For long sentences, split by commas (,,、)
  3. Word-level segmentation - For continuous text without punctuation, split by whitespace

Each level validates token count and only proceeds to the next level if needed. Segments are synthesized independently and concatenated with 300ms natural pauses between them.

Key Features

  • Automatic handling of texts exceeding 240-token limit
  • Smart merging to maximize segment length within constraints
  • Natural sentence pauses (300ms) between concatenated segments
  • Maintains backward compatibility - short texts process unchanged
  • Clear error messages for edge cases (e.g., single word >240 tokens)

Changes

  • Added inputTooLong error case with descriptive message
  • Added segmentText() for sentence boundary detection
  • Added segmentByCommas() for comma-based splitting
  • Added segmentByWords() as final fallback for continuous text
  • Added concatenateBuffers() for seamless audio merging
  • Refactored generate() to handle multi-segment synthesis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant