Abstract

We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—words of common origin and meaning across languages. CogNet is continuously evolving: its current version contains over 8 million cognate pairs over 338 languages and 35 writing systems, with new releases already in preparation. The paper presents the algorithm and input resources used for its computation, an evaluation of the result, as well as a quantitative analysis of cognate data leading to novel insights on language diversity. Furthermore, as an example on the use of large-scale cross-lingual knowledge bases for improving the quality of multilingual applications, we present a case study on the use of CogNet for bilingual lexicon induction in the framework of cross-lingual transfer learning.

Highlights

  • Cognates are words in different languages that share a common origin and the same meaning, such as the English letter and the French lettre

  • We used two string similarity methods often applied to cognate identification (St Arnaud et al, 2017): LCS, i.e. the longest common subsequence ratio of two words, and Consonant (Turchin et al, 2010), which is a heuristic method that checks if the first three consonants of the words are identical

  • In order to cross-check the quality of the output, we randomly sampled 400 cognate pairs not covered by the self-annotated evaluation corpus and had them re-evaluated by the same expert annotators

Read more

Summary

Introduction

Cognates are words in different languages that share a common origin and the same meaning, such as the English letter and the French lettre. Popular databases that are used by cognacy-based methods in historical linguistics, such as ASJP (Jager, 2018; Wichmann et al, 2010), IELex (Bouckaert et al, 2012), or ABVD (Greenhill et al, 2008), have by design a low lexical coverage of typically less than a hundred basic concepts per language, but with an extremely broad coverage of up to 4000 languages In these databases, lexical entries that belong to scripts other than Latin or Cyrillic mostly appear in phonetic transcription instead of their actual orthographies in their original scripts, limiting their use for processing written text

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call