An Hybrid Part of Speech Tagger for Setswana Language using a Voting Method

Mary Dibitso,Pius A Owolawi,Sunday O Ojo

doi:10.59200/iconic.2022.027

Abstract

Part-of Speech (PoS) tagging is a corpus linguistics that deals with assigning appropriate lexical categories to each word in a sentence. To effectively address challenges associated with PoS tagging, several Natural Language Processing (NLP) tasks modelling techniques have been employed, including Conditional Random Fields (CRF), Support Vector Machines (SVM), and Decision Trees in diverse languages. These PoS taggers implement the process of associating the correct PoS (nouns, verbs, adjectives, adverbs, etc.) with each word in a sentence. However, creating language resources is an expensive process for many languages, including the indigenous languages of South Africa that are classified as resource-scarce. Therefore, using Setswana as a language with limited resources, this study explores and applies methods to increase the utilization of existing resources and tagger accuracy. This is done using Setswana's two PoS taggers: a Maximum Entropy (MaxEnt) and an SVM, which achieved an accurateness of 94.4 per cent and 95.59 per cent respectively. To find errors in the taggers, an error analysis is carried out. The Setswana PoS Tagger was then built using a voting algorithm to improve results and attain 97.06 per cent accuracy. The combination of taggers reduces the error rate by up to 2.01 per cent.

Full Text