Problem
Present tense sentences are pronounced incorrectly:
- "I read books every day" → "red" instead of "reed"
- "I reread this chapter weekly" → "re-red" instead of "re-reed"
Root Cause
The dictionary mixes two tagging conventions inconsistently:
Misaki's own tags (work correctly):
| Tag |
Entries |
Usage |
| DEFAULT |
789 |
Fallback pronunciation |
| VERB |
272 |
Generic verb form |
| NOUN |
425 |
Noun form |
| ADJ |
63 |
Adjective form |
| None |
32 |
Stressed function words |
Penn Treebank tags (used inconsistently):
| Tag |
Entries |
Issue |
| VBD |
4 |
Correct (past tense) |
| VBN |
3 |
Correct (past participle) |
| VBP |
3 |
All 3 are wrong |
Since Misaki uses spaCy (en_core_web_sm), the tagger outputs Penn Treebank tags. When spaCy returns VBP (present tense, non-3rd person singular), the dictionary maps it to past-tense pronunciations:
"read": {"VBP": "ɹˈɛd", ...} // VBP="I read daily" should be "reed"
"reread": {"VBP": "ɹiɹˈɛd", ...} // VBP="I reread often" should be "re-reed"
"wound": {"VBP": "wˈWnd", ...} // VBP="I wound it" should be "woond" (injury)
It appears VBD/VBN/VBP were added specifically for these homographs, but VBP was mapped using past-tense semantics rather than Penn Treebank's definition (present tense).
Suggested Fix
Option A: Fix the 3 entries (minimal change)
"read": {"DEFAULT": "ɹˈid", "VBP": "ɹˈid", "VBD": "ɹˈɛd", "VBN": "ɹˈɛd", "ADJ": "ɹˈɛd"}
"reread": {"DEFAULT": "ɹˌiɹˈid", "VBP": "ɹˌiɹˈid", "VBD": "ɹiɹˈɛd", "VBN": "ɹiɹˈɛd", "NOUN": "ɹˈiɹid"}
"wound": {"DEFAULT": "wˈund", "VBP": "wˈund", "VBD": "wˈWnd", "VBN": "wˈWnd"}
Option B: Standardize tagging convention (architectural)
Consider whether to:
- Use Penn Treebank tags consistently (VB, VBD, VBG, VBN, VBP, VBZ)
- Use Misaki's own tags consistently (VERB, NOUN, ADJ, etc.)
- Document the hybrid approach and ensure mappings are correct
Affected Entries
These are the only 3 entries with VBP tags, and all three are incorrect:
Happy to submit a PR for Option A, or discuss Option B further.
Problem
Present tense sentences are pronounced incorrectly:
Root Cause
The dictionary mixes two tagging conventions inconsistently:
Misaki's own tags (work correctly):
Penn Treebank tags (used inconsistently):
Since Misaki uses spaCy (
en_core_web_sm), the tagger outputs Penn Treebank tags. When spaCy returnsVBP(present tense, non-3rd person singular), the dictionary maps it to past-tense pronunciations:It appears VBD/VBN/VBP were added specifically for these homographs, but VBP was mapped using past-tense semantics rather than Penn Treebank's definition (present tense).
Suggested Fix
Option A: Fix the 3 entries (minimal change)
Option B: Standardize tagging convention (architectural)
Consider whether to:
Affected Entries
These are the only 3 entries with VBP tags, and all three are incorrect:
Happy to submit a PR for Option A, or discuss Option B further.