Skip to content

Conversation

@overdramatic
Copy link
Contributor

@overdramatic overdramatic commented Nov 23, 2025

Add DiffSinger BRAPA Phonemizer

⚠️ Please, merge this PR after PR #1840

Overview

This PR adds DiffSingerBrapaPhonemizer, an advanced Portuguese phonemizer for the DiffSinger engine that provides sophisticated phonetic processing and accent support beyond the basic Portuguese phonemizer.

Key Features

Advanced Phonetic Processing

Extends DiffSingerRefinedPhonemizer for advanced phonetic rule application

Duration-Based Rules

Implements smart phoneme substitutions for short notes:

  • aax for duration ≤ 45ms
  • ii0 for duration ≤ 45ms
  • uu0 for duration ≤ 45ms

Word Boundary Rules

  • Sibilant Evolution: s9z before vowels, s9z9 before voiced consonants
  • Rhotic Transformation: r9r before vowels
  • Nasal-Plosive Interaction: clng when preceded by nasal vowels and followed by voiced plosives

ℹ️Rules application can be skipped if the lyrics has phonetic hint [ ] or it starts with /

Technical Implementation

  • Dictionary Integration: Uses dsdict-brapa.yaml for Portuguese-specific phonetic rules
  • G2P Integration: Leverages BrapaG2p class for grapheme-to-phoneme conversion
  • Context-Aware Processing: Maintains phonetic context across word boundaries

BRAPA G2P Model

Overview

This PR also introduces the BRAPA (Brazilian Phonetic Alphabet) G2P model, a specialized grapheme-to-phoneme converter optimized for Brazilian Portuguese with enhanced accent handling and extended phonetic features.

Key Features

Enhanced Brazilian Portuguese Support

  • Improved accent implementation specifically tuned for Brazilian Portuguese pronunciation patterns
  • Better handling of regional variations and contemporary speech characteristics

New Dummy Phonemes for Phonetic Precision

Added specialized placeholder phonemes for accurate phonetic representation:

  • s9: Represents S-sound before consonants or at word endings. Can be changed into s or sh
    • Example: "castelo" → /k a s9 t eh l u/
  • z9: Represents Z-sound before voiced consonants. Can be changed into z or j
    • Example: "mesmo" → /m e z9 m u/
  • h9: Represents R-sound before vowels. Can be changed into h, hr or x
    • Example: "caro" → /k a h9 u/
  • r9: Generic placeholder for rhotic sounds. Can be changed into h, hr, r, rw or x
    • Example: "porta" → /p oh r9 t a/

Extended Phonetic Support

Added extra phonemes for better accent handling

  • ah: Represents the default vowel phoneme in stressed syllables where the letter “a” precedes a nasal consonant. Can be orthographically represented as "a" or "â". Examples:
    • cama → /k ah m a/
    • câmera → /k ah m e r ax/
  • ng: Default phoneme for nasal vowel + g interaction. Can be changed into g
    • Example: "manga" → /m an ng a/
  • wn: Default phoneme for nasal vowel + w interaction. Can be changed into w
    • Example: "mão" → /m an wn/
  • yn: Default phoneme for nasal vowel + y interaction. Can be changed into y
    • Example: "mãe" → /m an yn/

Advanced Lyric Processing Features

New markup system for vocal processing in lyrics:

  • [#]: Enables cl (clousure/glottal stop)
  • [']: Enables vf (vocal fry)

Usage

For models / voicebanks that has Portuguese capabilities with BRAPA, this replacements are recommended:

- {from: ng, to: pt/g} # Nasal + g. Can be change into [ng] or [g]
- {from: h9, to: pt/h} # R-sound before a Vowel. Can be changed into [h], [hr] or [x]
- {from: r9, to: pt/h} # Rhotic, can be changed into [h], [hr], [r], [rw] or [x]
- {from: s9, to: pt/s} # S-sound before a consonant or in the end of a word. Can be changed into [s] or [sh]
- {from: z9, to: pt/z} # Z-sound before a voiced consonant. Can be changed into [z] or [j]
- {from: ah, to: pt/ax}
- {from: wn, to: pt/w}
- {from: yn, to: pt/y}
- {from: cl, to: cl}
- {from: vf, to: vf}

Techninical Features

Dictionary Entries Total: 75170

  • Words: 26766
  • Syllabic Support: 15470
  • [cl] support: 16467
  • [vf] support: 16467

Vowels:

 "a", "ae", "ah", "an", "ax", "e", "eh", "en", "i", "i0", "in","o", "oh", "on", "u", "u0", "un"

Consonants

"b", "ch", "d", "dj", "f", "g", "h", "h9", "hr", "j", "k", "l", "lh", "m", "n",
"ng", "nh", "p", "r", "r9", "rh", "rr", "rw", "s", "s9", "sh", "t", "v", "w", "wn",
"x", "y", "yn", "z", "z9", "cl", "vf"

Drawbacks

RNN-T Models is not good with complex languages, so OOV entries can give wrong results. If that happens, try writing how the word is spoken instead of written

- Add comprehensive G2P dictionary support with replacement types (single, merge, split, many-to-many)
- Implement sophisticated phoneme validation and language prefix handling
- Add timing-based phoneme processing with duration-dependent modifications
- Include word-level phoneme editing capabilities for cross-phoneme modifications
- Integrate ONNX machine learning models (linguistic and duration prediction)
- Support multi-language models with language ID mapping
- Add comprehensive speaker embedding management
- Include tensor caching for performance optimization
- Implement robust error handling and validation throughout
- Add extensive XML documentation for all public APIs

This phonemizer provides a robust foundation for DiffSinger voice synthesis with support for complex phoneme transformations and cross-note processing.
- Add BRAPA (Brazilian Portuguese Phonetic Alphabet) G2P
- Update brapa-g2p.zip
- Added support on phonetic assistant
Introduces DiffSingerBrapaPhonemizer, a phonemizer for Brazilian Portuguese (BRAPA) in the DiffSinger engine. Implements phonetic rules, word boundary processing, and duration-based phoneme replacements tailored for the language.
Refactored EditTimedPhonemes to accept next note and its first phoneme duration.
Updated ProcessPart to collect phoneme timing data in two passes, allowing timing edits with access to neighboring note information
Updated the class name to match the phonemizer name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant