Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics

Sarah Moeller,Mans Hulden

doi:10.33011/computel.v1i.965

Abstract

Any attempt to integrate NLP systems to the study of endangered languages must take into consideration traditional approaches by both NLP and linguistics. This paper tests different strategies and workflows for morpheme segmentation and glossing that may affect the potential to integrate machine learning. Two experiments train Transformer models on documentary corpora from five under-documented languages. In one experiment, a model learns segmentation and glossing as a joint step and another model learns the tasks into two sequential steps. We find the sequential approach yields somewhat better results. In a second experiment, one model is trained on surface segmented data, where strings of texts have been simply divided at morpheme boundaries. Another model is trained on canonically segmented data, the approach preferred by linguists, where abstract, underlying forms are represented. We find no clear advantage to either segmentation strategy and note that the difference between them disappears as training data increases. On average the models achieve more than a 0.5 F1-score, with the best models scoring 0.6 or above. An analysis of errors leads us to conclude consistency during manual segmentation and glossing may facilitate higher scores from automatic evaluation but in reality the scores may be lowered when evaluated against original data because instances of annotator error in the original data are “corrected” by the model.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Workshop on Computational Methods for Endangered Languages

Lead the way for us

Journal: Proceedings of the Workshop on Computational Methods for Endangered Languages	Publication Date: Jan 1, 2021
Citations: 1

Similar Papers

Impact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives
Bhuvana Ramabhadran ... Jing Huang
-
Bhuvana Ramabhadran, et. al.Bhuvana Ramabhadran ... Jing Huang
01 Sep 2003
01 Sep 2003

Does morphological complexity affect word segmentation? Evidence from computational modeling
Georgia Loukatou ... Alejandrina Cristia
Cognition | VOL. 220
Georgia Loukatou, et. al.Georgia Loukatou ... Alejandrina Cristia
14 Dec 2021
Cognition | VOL. 220

Is Market Segmentation Dead? A Conceptual Model of the Effect of Segmentation Choices on Marketing Performance
Adina Poenaru
-
Adina PoenaruAdina Poenaru
08 Oct 2014
08 Oct 2014

Comparison between Manual and Semi-automatic Segmentation of Nasal Cavity and Paranasal Sinuses from CT Images
K Tingelhoff ... F M Wahl
-
K Tingelhoff, et. al.K Tingelhoff ... F M Wahl
01 Aug 2007
01 Aug 2007

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics

Abstract

Talk to us

Similar Papers

More From: Proceedings of the Workshop on Computational Methods for Endangered Languages