Abstract

Development of human language technologies for the indigenous South African languages is currently being undertaken in various projects across South Africa. In one such project a lemmatizer for Setswana is being developed, and this article reports on work towards the development of a first prototype. A prerequisite of lemmatization is to determine what the output of a lemmatizer for a specific language should be (i.e. what should be considered a lemma in that language). Consequently, the concept of a lemma as it should be understood in the context of Setswana lemmatization is defined, and it is indicated that only nouns and verbs really pose challenges for the lemmatization of Setswana. The computational approach taken in this research, and the implementation applied, which use FSA 6, are described at length. Preliminary results indicate that the rules for nouns and verbs are rather accurate, with precision scores of 93–94% obtained in a small, contained experiment. The article concludes with a discussion of future work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call