Abstract

Working with corpora in the South African Bantu languages has up till now been limited to the utilisation of raw corpora. Such corpora, however, have limited functionality. Thus the next logical step in any NLP application is the development of software for automatic tagging of electronic texts. The development of a tagset is one of the first steps in corpus annotation. The authors of this article argue that the design of a tagset cannot be isolated from the purpose of the tagset, or from the place of the tagset and its design within the bigger picture of the architecture of corpus annotation. Usage-related aspects therefore feature prominently in the design of the tagset for Northern Sotho. It is explained why this proposed tagset is biased towards human readability, rather than machine readability; this choice of a stochastic tagger is motivated, and the relationship between tokenising, tagging, morphological analysis and parsing is discussed. In order to account at least to some extent for the morphological complexity of Northern Sotho at the tagging level, a multilevel annotation is opted for: the first level comprising obligatory information and the second optional and recommended information. Finally, aspects of standardisation are considered against the background of reuse, of sharing of resources, and of possible adaptation for use by other disjunctively written South African Bantu languages. It is not the aim of this article to evaluate the results of any tagging procedure using the proposed tagset. It only describes the design and motivates the choices made with regard to the tagset design. However, an evaluation is in process and results will be published in the near future (cf. Faaß et al., s.a.).

Highlights

  • 1.1 Context and objectivesSouth Africa is a relative newcomer to the field of HLT, the pool of expertise and skills in this regard is still rather small

  • This is important with regard to the development of software tools to be used in NLP applications; it is true of African language resources

  • The class membership feature: The classes 01-15 and the locative classes 16-18 (LOC) are all assigned at the first level of annotation, except the so-called copulative subject concords, which actially function as full copulative and verbs are tagged as such, the question words (QUE), see below

Read more

Summary

Context and objectives

South Africa is a relative newcomer to the field of HLT, the pool of expertise and skills in this regard is still rather small. It is of the utmost importance that existing expertise is utilised in the most effective manner This is important with regard to the development of software tools to be used in NLP applications; it is true of African language resources. The University of Pretoria Sepedi Corpus (PSC) being one of these organic corpora, currently stands at about 6,2 million words. Since their initial conception, these corpora have been used in many different applications, e.g. the compilation of wordlists used for the building of spellcheckers, linguistic and terminological research, and translation studies. The logical step would be the annotation of these corpora in order to increase their (re)usability and multi-functionality (cf. Snyman et al, 2007)

On standardisation
Design of a tagset for Northern Sotho
Principles of tagset design
Usage-related aspects
Tagset design within the context of corpus annotation
Linguistic and formal aspects of the design of a Northern Sotho tagset
Current version of the POS tagset for Northern Sotho
More issues in the application of the tagset
Multi-unit tokens
Yet undefined forms
Conclusions and future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.