Scripts for converting the BabyLM dataset and BabyLM evaluation data to an IPA representation.
To run the scripts, first create a virtual environment for the project by running setup.sh
.
./setup.sh
source setup.sh
You will then need to download the BabyLM 2024 dataset and evaluation data (both available here) and place them in the repository.
These scripts were used to prepare the training data, evaluation data and tokenizers used in From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.
- See
prepare_babylm.ipynb
for the code used to create the IPA-BabyLM dataset. - See
prepare_evaluation_data.ipynb
for the code used to create the IPA-BabyLM evaluation data. - See
train_tokenizer.ipynb
for the code used to create the eight tokenizers.
The trainings scripts are available here.