Abstract

Can an unknown Amazonian language be identified by statistical procedures based on n-gram frequencies if only a short list of words is available and at the same time, the available data of the potential candidate languages are also limited to relatively short wordlists? In this paper we show that n-gram frequencies (specifically 1-grams and 2-grams) allow us to identify languages reliably based on as few as 20 words, as long as these are transcribed consistently, and as long as characteristic monogram and bigram frequencies for these languages have previously been established based on consistently transcribed data. If no such consistently transcribed data are available, as is the case of our Amazonian case study, such procedures clearly fail for wordlists with 50 or fewer words. Our study thus contributes to exploring the limits of such automated detection procedures, both in terms of corpus size and transcription quality.

Highlights

  • Automated language identification has been shown to work reliably when large amounts of consistently transcribed data are available, such as internet corpora

  • Can an unknown Amazonian language be identified by statistical procedures based on n-gram frequencies if only a short list of words is available and at the same time, the available data of the potential candidate languages are limited to relatively short wordlists? In this paper we show that n-gram frequencies ( 1-grams and 2-grams) allow us to identify languages reliably based on as few as 20 words, as long as these are transcribed consistently, and as long as characteristic monogram and bigram frequencies for these languages have previously been established based on consistently transcribed data

  • The point of departure for the current study is a case study of the Amazonian language called Carabayo which is of unknown affiliation and spoken by a tribe living in voluntary isolation in the Colombian Amazon region (Franco, 2012; Seifart & Echeverri, 2014)

Read more

Summary

Introduction

Automated language identification has been shown to work reliably when large amounts of consistently transcribed data are available, such as internet corpora. A short list of words in their language was noted down (most of them without translation) from conversations among Carabayos that was overheard in 1969 during a brief encounter with one Carabayo family. During this encounter, it was established that the Carabayo language is mutually unintelligible with any of the living indigenous languages of the area. The quality of transcription for all of this material is poor, i.e. we expect inconsistencies and errors in the graphic representation of sounds, as well as in the segmentation of the phonetic material into words

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.