Abstract
We built a pipeline to normalize Quechua texts through morphological analysis and disambiguation. Word forms are analyzed by a set of cascaded finite state transducers which split the words and rewrite the morphemes to a normalized form. However, some of these morphemes, or rather morpheme combinations, are ambiguous, which may affect the normalization. For this reason, we disambiguate the morpheme sequences with conditional random fields. Once we know the individual morphemes of a word, we can generate the normalized word form from the disambiguated morphemes.
Highlights
As part of our research project we have developed several tools and resources for Cuzco Quechua
As standardized spelling is an indispensable prerequisite for any statistical processing, we built a pipeline to normalize Quechua texts through morphological analysis and disambiguation
In every pair of transducers, the first one follows a relatively strict orthography, whereas the second one has a set of phonological rules that allow for more variation in the spelling of word forms
Summary
As part of our research project we have developed several tools and resources for Cuzco Quechua. This includes a hybrid machine translation system Spanish-Quechua. An issue that is generally difficult to deal with in a rule-based approach is the lexical choice of translation options: writing context rules for every possible translation of a given input word is not feasible. Another solution is to include a language model, trained on Quechua texts, that can handle the lexical disambiguation. We chose to disambiguate the cases that are relevant for the normalization, but all types of morphological ambiguities
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.