Statistical tokenization algorithms 

Hi @aviks and @Oxinabox
**Statical tokenizers** are used in lot of **Transformer based models** including BERT family becasue of their ablity to tackle out of vocabulary problem.
After going through Tokenizers in `WordTokenizers.jl` which i think is pretty good and fast and it will be great if we can **built statical tokenizers** like BPE, unigram language models etc.  on top of it.
I have gone through the following papers-
[BPE](https://www.aclweb.org/anthology/P16-1162/)
[unigram language model](https://arxiv.org/abs/1804.10959) 
any suggestions how to proceed ?
Where should we keep it in `TextAnlaysis.jl` or `WordTokenizers.jl` ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statistical tokenization algorithms #207

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Statistical tokenization algorithms #207

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions