Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool

Mary Ambrossine Dibitso,Sunday Olusegun Ojo,Pius Adewale Owolawi

doi:10.1007/978-3-030-34974-5_6

Abstract

Setswana is an under-resourced Bantu African language that is morphologically rich with the disjunctive writing system. Developing NLP pipeline tools for such a language could be challenging, due to the need to balance the linguistics semantics robustness of the tool with computational parsimony. A Part-of-Speech (POS) tagger is one such NLP tool for assigning lexical categories like noun, verb, pronoun, and so on, to each word in a text corpus. POS tagging is an important task in Natural Language Processing (NLP) applications such as information extraction, Machine Translation, Word prediction, etc. Developing a POS tagger for a morphologically rich language such as Setswana has computational linguistics challenges that could affect the effectiveness of the entire NLP system. This is due to some contextual semantics features of the language, that demand a fine-grained granularity level for the required POS tagset, with the need to balance tool semantic robustness with computational parsimony. In this paper, a context-driven corpus-based model for text segmentation and POS tagging for the language is presented. The tagger is developed using the Apache OpenNLP tool and returns the accuracy of 96.73%.

Full Text