Abstract

Part-of Speech (PoS) tagging is a corpus linguistics that deals with assigning appropriate lexical categories to each word in a sentence. To effectively address challenges associated with PoS tagging, several Natural Language Processing (NLP) tasks modelling techniques have been employed, including Conditional Random Fields (CRF), Support Vector Machines (SVM), and Decision Trees in diverse languages. These PoS taggers implement the process of associating the correct PoS (nouns, verbs, adjectives, adverbs, etc.) with each word in a sentence. However, creating language resources is an expensive process for many languages, including the indigenous languages of South Africa that are classified as resource-scarce. Therefore, using Setswana as a language with limited resources, this study explores and applies methods to increase the utilization of existing resources and tagger accuracy. This is done using Setswana's two PoS taggers: a Maximum Entropy (MaxEnt) and an SVM, which achieved an accurateness of 94.4 per cent and 95.59 per cent respectively. To find errors in the taggers, an error analysis is carried out. The Setswana PoS Tagger was then built using a voting algorithm to improve results and attain 97.06 per cent accuracy. The combination of taggers reduces the error rate by up to 2.01 per cent.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.