Abstract
The creation of linguistic resources is crucial to the continued growth of research and development efforts in the field of natural language processing, especially for resource-scarce languages. In this paper, we describe the curation and annotation of corpora and the development of multiple linguistic technologies for four official South African languages, namely isiNdebele, Siswati, isiXhosa, and isiZulu. Development efforts included sourcing parallel data for these languages and annotating each on token, orthographic, morphological, and morphosyntactic levels. These sets were in turn used to create and evaluate three core technologies, viz. a lemmatizer, part-of-speech tagger, morphological analyzer for each of the languages. We report on the quality of these technologies which improve on previously developed rule-based technologies as part of a similar initiative in 2013. These resources are made publicly accessible through a local resource agency with the intention of fostering further development of both resources and technologies that may benefit the NLP industry in South Africa.
Highlights
Access to linguistic resources such as annotated data helps to facilitate, or even hinder, research and development efforts based on its quality and availability
When Eiselen and Puttkammer [5] developed core technologies for 10 of the official South African languages, they concluded that the morphological analysis of disjunctively written languages performed relatively well, while those for conjunctively written languages warranted more research
The results show that adopting a language independent implementation such as Lemming and MarMot performs comparably better over their rulebased counterparts for the considered conjunctively written Nguni languages: isiNdebele, Siswati, isiXhosa, and isiZulu
Summary
Access to linguistic resources such as annotated data helps to facilitate, or even hinder, research and development efforts based on its quality and availability. Central to these efforts is the notion of lexical semantics, generally defined as the analysis of words and lexical units in terms of their classification, decomposition, and their lexical meaning in relationship to context and syntax. Contemporary research relies on natural language processing (NLP) to investigate usage patterns within large electronic corpora to achieve lexical semantic tasks such as word sense disambiguation and semantic role labelling [2]. NLP applications rely on these tasks to perform machine translation, information extraction, text classification, among other tasks. For under-resourced languages, this approach suffers due to the scarcity and often lacking quality of available lexical data [3]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.