-
Notifications
You must be signed in to change notification settings - Fork 94
Closed
Description
Hi @aviks and @oxinabox
Statical tokenizers are used in lot of Transformer based models including BERT family becasue of their ablity to tackle out of vocabulary problem.
After going through Tokenizers in WordTokenizers.jl
which i think is pretty good and fast and it will be great if we can built statical tokenizers like BPE, unigram language models etc. on top of it.
I have gone through the following papers-
BPE
unigram language model
any suggestions how to proceed ?
Where should we keep it in TextAnlaysis.jl
or WordTokenizers.jl
?
Metadata
Metadata
Assignees
Labels
No labels