📜 Multilingual Alignment and Collation of the De Regimine Principum in Latin, Vernacular, and English Traditions 🌍

A corpus-based project for sentence segmentation, multilingual alignment, and philological analysis of medieval translations.

This repository brings together historical linguistics, digital humanities, and natural language processing (NLP) to address a key challenge in premodern textual studies:

🧠 How can we systematically align, compare, and computationally analyze medieval prose across diverse linguistic, scribal, and editorial traditions?

We provide open-access datasets and tools for:

✂️ Sentence segmentation
🌐 Multilingual alignment
🧩 Textual collation and variant tracking

🏰 A Mirror for Princes Across Borders

De Regimine Principum: Transmission and Translation

Composed in Latin in the late 13th century, De Regimine Principum rapidly circulated throughout Europe and was translated into a range of vernaculars.
This corpus offers a unique testbed for multilingual comparison, combining literary, political, and philosophical content.

📚 Research Background

This work on De Regimine Principum represents the second phase of a broader research initiative exploring computational methods for multilingual medieval corpora.

The first phase, detailed in the Aquilign repository, focused on building a modular pipeline for segmentation, alignment, and collation.
It was first applied to the Lancelot en prose corpus — a multilingual narrative tradition — to evaluate alignment accuracy across closely related Romance languages.

➡️ The resulting corpus and annotated alignments are published in the companion repository:
lancelot-par-maints-langages

The present project builds on this foundation and expands the methodology to De Regimine Principum and its vernacular and English translations, a corpus that introduces greater generic, structural, and linguistic diversity.

🎯 Project Goals

Create historically informed, linguistically diverse training data for historical NLP
Model sentence segmentation in medieval Romance and Latin languages
Perform multilingual alignment of parallel textual traditions
Enable collation and variant tracking across translated and original versions

🧪 Methodology

We combine philological insight with machine learning techniques:

🧭 Collation workflow: a reference collation table is built from Latin witnesses
🌍 Alignment: vernacular versions are segmented and aligned using embedding-based tools
🧮 Variant detection: semantic and structural divergences are analyzed using both manual and computational methods

📊 Preliminary Results

Preliminary alignments and exploratory visualizations are available via the dedicated project page: ➡️ Multilingual Aegidius Project Page

💾 Data Overview

This section outlines how textual data was prepared and structured in the Multilingual Aegidius project.

🧩 Corpus Overview

The construction of the multilingual corpus involved several stages, combining both curated datasets and primary source texts. Each step in the pipeline is modular, reproducible, and designed for extensibility—enabling future applications across different authors, languages, or textual genres.

🧠 Training Datasets for Segmentation and Alignment

To support the development of robust segmentation and alignment models, the project relies on two complementary resources hosted in separate repositories:

multilingual-segmentation-dataset Annotated datasets for sentence and phrase segmentation in historical texts. Includes manually segmented examples across seven medieval languages (13th–16th c.), used to train and evaluate BERT-based segmenters for historical prose.
parallelium-scriptures-alignment-dataset
A multilingual dataset of aligned Biblical and Qur’anic texts, combining medieval and modern versions in 9 languages. Designed for training alignment models, especially in historical and philological contexts.

These resources form the foundation for segmentation and alignment tasks within the Aquilign pipeline, used to process the Aegidius corpus.

📂 Core Aegidius Corpus

The data/aegidius directory contains the core multilingual text corpus for this project. It features versions of De Regimine Principum in several medieval languages, supporting research in:

Historical linguistics
Machine translation
Philological analysis

📄 Contents:

Parallel texts in Latin, French, English, and more
Sentence-level alignments for comparative study
Metadata for sources and editions

This corpus forms the foundation of the preliminary results presented in the Results section.

📦 Models

Pretrained models and evaluation outputs will be published here as they become available.

📌 Stay tuned for Hugging Face links and downloadable checkpoints in future releases.

🤝 Contributing to the Project

Contributions to the project are highly encouraged, whether they be additional data, bug fixes, or enhancements to the analysis scripts. To contribute:

Fork the Repository – Start by forking the repository and cloning it locally.
Create a Branch – Make your changes in a new branch named after the feature or fix.
Submit a Pull Request – After pushing your changes to your fork, open a pull request for discussion and review.

🔗 Related Projects

This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:

Aquilign
A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), designed specifically for premodern texts.
Multilingual Segmentation Dataset Source texts and segmented versions in multiple medieval Romance languages, as well as Latin and English, used for training and evaluating clause segmentation models.
Parallelium – an aligned scriptures dataset
A multilingual dataset of aligned Biblical and Qur’anic texts — spanning medieval and modern languages — designed for training and evaluating multilingual alignment models, especially in historical and philological contexts.
Lancelot par maints langages
A parallel corpus of translations of the Lancelot en prose in medieval French, Castilian, and Italian, segmented and aligned using the Aquilign pipeline.

🧾 Talks & Slides

🎤 Congrès International de Linguistique et de Philologie Romanes (CILPR 2025)

Premiers jalons de la collation multilingue du De regimine principum latin et vernaculaire
👥 Matthias Gille Levenson, Lucence Ing, Carolina Macedo
📽️ View presentation slides (PDF)

💰 Funding

This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).

Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).

⚖️ Licensing

This project is licensed under the CC BY-NC-SA 4.0 license.
This license allows users to adapt, remix, and build upon the work non-commercially, as long as they credit the authors and license their new creations under the same terms.

Name		Name	Last commit message	Last commit date
Latest commit History 419 Commits
.github/workflows		.github/workflows
data/aegidius		data/aegidius
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
index.markdown		index.markdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📜 Multilingual Alignment and Collation of the De Regimine Principum in Latin, Vernacular, and English Traditions 🌍

🏰 A Mirror for Princes Across Borders

📚 Research Background

🎯 Project Goals

🧪 Methodology

📊 Preliminary Results

💾 Data Overview

🧩 Corpus Overview

🧠 Training Datasets for Segmentation and Alignment

📂 Core Aegidius Corpus

📄 Contents:

📦 Models

🤝 Contributing to the Project

🔗 Related Projects

🧾 Talks & Slides

🎤 Congrès International de Linguistique et de Philologie Romanes (CILPR 2025)

💰 Funding

⚖️ Licensing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

ProMeText/Multilingual_Aegidius

Folders and files

Latest commit

History

Repository files navigation

📜 Multilingual Alignment and Collation of the De Regimine Principum in Latin, Vernacular, and English Traditions 🌍

🏰 A Mirror for Princes Across Borders

📚 Research Background

🎯 Project Goals

🧪 Methodology

📊 Preliminary Results

💾 Data Overview

🧩 Corpus Overview

🧠 Training Datasets for Segmentation and Alignment

📂 Core Aegidius Corpus

📄 Contents:

📦 Models

🤝 Contributing to the Project

🔗 Related Projects

🧾 Talks & Slides

🎤 Congrès International de Linguistique et de Philologie Romanes (CILPR 2025)

💰 Funding

⚖️ Licensing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages