Skip to content

codebyzeb/babylm-ipa

Repository files navigation

BabyLM to IPA

Scripts for converting the BabyLM dataset and BabyLM evaluation data to an IPA representation.

Installation

To run the scripts, first create a virtual environment for the project by running setup.sh.

./setup.sh
source setup.sh

You will then need to download the BabyLM 2024 dataset and evaluation data (both available here) and place them in the repository.

Usage

These scripts were used to prepare the training data, evaluation data and tokenizers used in From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes.

The trainings scripts are available here.

About

Scripts for converting the BabyLM dataset to phonemes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages