Automatic lemmatization in Setswana: towards a prototype

Karien Brits,Rigardt Pretorius,Gerhard B Van Huyssteen

doi:10.1080/02572117.2005.10587247

Abstract

Development of human language technologies for the indigenous South African languages is currently being undertaken in various projects across South Africa. In one such project a lemmatizer for Setswana is being developed, and this article reports on work towards the development of a first prototype. A prerequisite of lemmatization is to determine what the output of a lemmatizer for a specific language should be (i.e. what should be considered a lemma in that language). Consequently, the concept of a lemma as it should be understood in the context of Setswana lemmatization is defined, and it is indicated that only nouns and verbs really pose challenges for the lemmatization of Setswana. The computational approach taken in this research, and the implementation applied, which use FSA 6, are described at length. Preliminary results indicate that the rules for nouns and verbs are rather accurate, with precision scores of 93–94% obtained in a small, contained experiment. The article concludes with a discussion of future work.

Full Text