BabyLM to IPA

Scripts for converting the BabyLM dataset and BabyLM evaluation data to an IPA representation.

Installation

To run the scripts, first create a virtual environment for the project by running setup.sh.

./setup.sh
source setup.sh

You will then need to download the BabyLM 2024 dataset and evaluation data (both available here) and place them in the repository.

These scripts were used to prepare the training data, evaluation data and tokenizers used in From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.

See prepare_babylm.ipynb for the code used to create the IPA-BabyLM dataset.
See prepare_evaluation_data.ipynb for the code used to create the IPA-BabyLM evaluation data.
See train_tokenizer.ipynb for the code used to create the eight tokenizers.

The trainings scripts are available here.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cleaning.py		cleaning.py
prepare_babylm.ipynb		prepare_babylm.ipynb
prepare_evaluation_data.ipynb		prepare_evaluation_data.ipynb
requirements.txt		requirements.txt
setup.sh		setup.sh
train_tokenizer.ipynb		train_tokenizer.ipynb