Abstract
This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.
Highlights
The Universal Dependencies (UD) project (Nivre et al, 2016, 2020) provides a cross-lingual syntactic dependency annotation scheme for many languages
Lawrence Island Yupik, a polysynthetic language spoken in parts of Alaska and Chukotka, Russia, within the framework of the UD guidelines
While UD is a framework for word-level annotations, we argue that morpheme-level annotations are more meaningful for polysynthetic languages
Summary
The Universal Dependencies (UD) project (Nivre et al, 2016, 2020) provides a cross-lingual syntactic dependency annotation scheme for many languages. The most recent release of the UD treebanks (version 2.7) contains 183 treebanks in 104 languages. Polysynthetic languages, known for words synthesizing multiple morphemes, are still much under-represented in the UD treebanks. Abaza and Chukchi (Tyers and Mishchenkova, 2020), are the only polysynthetic languages included in UD version 2.7. We describe how we annotated a corpus of St. Lawrence Island Yupik ( known as Central Siberian Yupik), a polysynthetic language spoken in parts of Alaska and Chukotka, Russia, within the framework of the UD guidelines. While UD is a framework for word-level annotations, we argue that morpheme-level annotations are more meaningful for polysynthetic languages. We provide morpheme-level annotations for Yupik in addition to word-level annotations.. We provide morpheme-level annotations for Yupik in addition to word-level annotations. We believe that subword-level annotations can help better capture morphosyntactic relations for polysynthetic languages and assist further dependency annotations and morphosyntactic research for polysynthetic languages
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.