Abstract

In this paper, we investigate how the prediction paradigm from machine learning and Natural Language Processing (NLP) can be put to use in computational historical linguistics. We propose word prediction as an intermediate task, where the forms of unseen words in some target language are predicted from the forms of the corresponding words in a source language. Word prediction allows us to develop algorithms for phylogenetic tree reconstruction, sound correspondence identification and cognate detection, in ways close to attested methods for linguistic reconstruction. We will discuss different factors, such as data representation and the choice of machine learning model, that have to be taken into account when applying prediction methods in historical linguistics. We present our own implementations and evaluate them on different tasks in historical linguistics.

Highlights

  • How are the languages of the world related and how have they evolved? This is the central question in one of the oldest linguistic disciplines: historical linguistics

  • Our work can be seen as part of what has been called the quantitative turn in historical linguistics: computational methods have been applied to automate parts of the workflow of historical linguistics (Jäger and List 2016), which, in part, has become possible due to the increased availability of digital datasets

  • How well suited are each of these encoding styles for the task of word prediction? We evaluate the three input encodings in combination with two machine learning models: the encoder-decoder and the structured perceptron

Read more

Summary

Introduction

How are the languages of the world related and how have they evolved? This is the central question in one of the oldest linguistic disciplines: historical linguistics. Different approaches have been applied to cognate detection – the task to detect ancestrally related words (cognates) in different languages – (Inkpen et al 2005; List 2012; Rama 2016; Jäger et al 2017; Dellert 2018), inference of sound correspondences (Hruschka et al 2015), protoform reconstruction (Bouchard-Côté et al 2013) and phylogenetic tree reconstruction (Jäger 2015; Chang et al 2015) These computational methods have opened up many new research directions, and, arguably, provide better replicability than manual methods because of the inherent necessity to specify formal guidelines (Jäger 2019). Examples are Gray and Atkinson (2003), which charted the age of Indo-European languages, and Bouckaert et al (2012), which proposed to map the Indo-European homeland to Anatolia

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call