Skip to content

Statistical tokenization algorithms  #207

@tejasvaidhyadev

Description

@tejasvaidhyadev

Hi @aviks and @oxinabox
Statical tokenizers are used in lot of Transformer based models including BERT family becasue of their ablity to tackle out of vocabulary problem.
After going through Tokenizers in WordTokenizers.jl which i think is pretty good and fast and it will be great if we can built statical tokenizers like BPE, unigram language models etc. on top of it.
I have gone through the following papers-
BPE
unigram language model
any suggestions how to proceed ?
Where should we keep it in TextAnlaysis.jl or WordTokenizers.jl ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions