📜 Multilingual Alignment and Collation of the De Regimine Principum in Latin, Vernacular, and English Traditions 🌍
A corpus-based project for sentence segmentation, multilingual alignment, and philological analysis of medieval translations.
This repository brings together historical linguistics, digital humanities, and natural language processing (NLP) to address a key challenge in premodern textual studies:
🧠 How can we systematically align, compare, and computationally analyze medieval prose across diverse linguistic, scribal, and editorial traditions?
We provide open-access datasets and tools for:
- ✂️ Sentence segmentation
- 🌐 Multilingual alignment
- 🧩 Textual collation and variant tracking
De Regimine Principum: Transmission and Translation
Composed in Latin in the late 13th century, De Regimine Principum rapidly circulated throughout Europe and was translated into a range of vernaculars.
This corpus offers a unique testbed for multilingual comparison, combining literary, political, and philosophical content.
This work on De Regimine Principum represents the second phase of a broader research initiative exploring computational methods for multilingual medieval corpora.
The first phase, detailed in the Aquilign repository, focused on building a modular pipeline for segmentation, alignment, and collation.
It was first applied to the Lancelot en prose corpus — a multilingual narrative tradition — to evaluate alignment accuracy across closely related Romance languages.
➡️ The resulting corpus and annotated alignments are published in the companion repository:
lancelot-par-maints-langages
The present project builds on this foundation and expands the methodology to De Regimine Principum and its vernacular and English translations, a corpus that introduces greater generic, structural, and linguistic diversity.
- Create historically informed, linguistically diverse training data for historical NLP
- Model sentence segmentation in medieval Romance and Latin languages
- Perform multilingual alignment of parallel textual traditions
- Enable collation and variant tracking across translated and original versions
We combine philological insight with machine learning techniques:
- 🧭 Collation workflow: a reference collation table is built from Latin witnesses
- 🌍 Alignment: vernacular versions are segmented and aligned using embedding-based tools
- 🧮 Variant detection: semantic and structural divergences are analyzed using both manual and computational methods
Preliminary alignments and exploratory visualizations are available via the dedicated project page: ➡️ Multilingual Aegidius Project Page
This section outlines how textual data was prepared and structured in the Multilingual Aegidius project.
The construction of the multilingual corpus involved several stages, combining both curated datasets and primary source texts. Each step in the pipeline is modular, reproducible, and designed for extensibility—enabling future applications across different authors, languages, or textual genres.
To support the development of robust segmentation and alignment models, the project relies on two complementary resources hosted in separate repositories:
-
multilingual-segmentation-dataset
Annotated datasets for sentence and phrase segmentation in historical texts. Includes manually segmented examples across seven medieval languages (13th–16th c.), used to train and evaluate BERT-based segmenters for historical prose. -
parallelium-scriptures-alignment-dataset
A multilingual dataset of aligned Biblical and Qur’anic texts, combining medieval and modern versions in 9 languages. Designed for training alignment models, especially in historical and philological contexts.
These resources form the foundation for segmentation and alignment tasks within the Aquilign pipeline, used to process the Aegidius corpus.
The data/aegidius
directory contains the core multilingual text corpus for this project. It features versions of De Regimine Principum in several medieval languages, supporting research in:
- Historical linguistics
- Machine translation
- Philological analysis
- Parallel texts in Latin, French, English, and more
- Sentence-level alignments for comparative study
- Metadata for sources and editions
This corpus forms the foundation of the preliminary results presented in the Results section.
Pretrained models and evaluation outputs will be published here as they become available.
📌 Stay tuned for Hugging Face links and downloadable checkpoints in future releases.
Contributions to the project are highly encouraged, whether they be additional data, bug fixes, or enhancements to the analysis scripts. To contribute:
- Fork the Repository – Start by forking the repository and cloning it locally.
- Create a Branch – Make your changes in a new branch named after the feature or fix.
- Submit a Pull Request – After pushing your changes to your fork, open a pull request for discussion and review.
This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:
-
Aquilign
A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts. -
Multilingual Segmentation Dataset Source texts and segmented versions in multiple medieval Romance languages, as well as Latin and English, used for training and evaluating clause segmentation models.
-
Parallelium – an aligned scriptures dataset
A multilingual dataset of aligned Biblical and Qur’anic texts — spanning medieval and modern languages — designed for training and evaluating multilingual alignment models, especially in historical and philological contexts. -
Lancelot par maints langages
A parallel corpus of translations of the Lancelot en prose in medieval French, Castilian, and Italian, segmented and aligned using the Aquilign pipeline.
Premiers jalons de la collation multilingue du De regimine principum latin et vernaculaire
👥 Matthias Gille Levenson, Lucence Ing, Carolina Macedo
📽️ View presentation slides (PDF)
This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).
Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).
This project is licensed under the CC BY-NC-SA 4.0 license.
This license allows users to adapt, remix, and build upon the work non-commercially, as long as they credit the authors and license their new creations under the same terms.