Abstract
Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic. For Iberian, the method does not show strong evidence supporting Basque as a related language, concurring with the favored position by the current scholarship. 1
Highlights
All the known cases of lost language decipherment have been accomplished by human experts, oftentimes over decades of painstaking efforts
Language proximity directly impacts the complexity of the decipherment tasks which in turn translates into lower model performance on Old English and Old Norse
We propose a decipherment model to extract cognates from undersegmented texts, without assuming proximity between lost and known languages
Summary
All the known cases of lost language decipherment have been accomplished by human experts, oftentimes over decades of painstaking efforts. At least a dozen languages are still undeciphered today For some of those languages, even the most fundamental questions pertaining to their origins and connections to known languages are shrouded in mystery, igniting fierce scientific debate among humanities scholars. Recent work has already demonstrated that algorithms can successfully decipher lost languages like Ugaritic and Linear B (Luo et al, 2019), relying only on non-parallel data in known languages – Hebrew and Ancient Greek, respectively. These methods are based on assumptions that are not applicable to many undeciphered scripts. The second assumption presumes that word boundaries are provided, which uniquely defines the vocabulary of the lost language
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Transactions of the Association for Computational Linguistics
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.