Skip to content

Inconsistent POS tag conventions cause incorrect pronunciations for "read", "reread", "wound" #94

@heyaphra

Description

@heyaphra

Problem

Present tense sentences are pronounced incorrectly:

  • "I read books every day" → "red" instead of "reed"
  • "I reread this chapter weekly" → "re-red" instead of "re-reed"

Root Cause

The dictionary mixes two tagging conventions inconsistently:

Misaki's own tags (work correctly):

Tag Entries Usage
DEFAULT 789 Fallback pronunciation
VERB 272 Generic verb form
NOUN 425 Noun form
ADJ 63 Adjective form
None 32 Stressed function words

Penn Treebank tags (used inconsistently):

Tag Entries Issue
VBD 4 Correct (past tense)
VBN 3 Correct (past participle)
VBP 3 All 3 are wrong

Since Misaki uses spaCy (en_core_web_sm), the tagger outputs Penn Treebank tags. When spaCy returns VBP (present tense, non-3rd person singular), the dictionary maps it to past-tense pronunciations:

"read":   {"VBP": "ɹˈɛd", ...}   // VBP="I read daily" should be "reed"
"reread": {"VBP": "ɹiɹˈɛd", ...} // VBP="I reread often" should be "re-reed"
"wound":  {"VBP": "wˈWnd", ...}  // VBP="I wound it" should be "woond" (injury)

It appears VBD/VBN/VBP were added specifically for these homographs, but VBP was mapped using past-tense semantics rather than Penn Treebank's definition (present tense).

Suggested Fix

Option A: Fix the 3 entries (minimal change)

"read":   {"DEFAULT": "ɹˈid", "VBP": "ɹˈid", "VBD": "ɹˈɛd", "VBN": "ɹˈɛd", "ADJ": "ɹˈɛd"}
"reread": {"DEFAULT": "ɹˌiɹˈid", "VBP": "ɹˌiɹˈid", "VBD": "ɹiɹˈɛd", "VBN": "ɹiɹˈɛd", "NOUN": "ɹˈiɹid"}
"wound":  {"DEFAULT": "wˈund", "VBP": "wˈund", "VBD": "wˈWnd", "VBN": "wˈWnd"}

Option B: Standardize tagging convention (architectural)

Consider whether to:

  • Use Penn Treebank tags consistently (VB, VBD, VBG, VBN, VBP, VBZ)
  • Use Misaki's own tags consistently (VERB, NOUN, ADJ, etc.)
  • Document the hybrid approach and ensure mappings are correct

Affected Entries

These are the only 3 entries with VBP tags, and all three are incorrect:

read, reread, wound

Happy to submit a PR for Option A, or discuss Option B further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions