On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

E Taljard,U Heid,D.J Prinsloo,G Faaß

doi:10.4102/lit.v29i1.103

Abstract

Working with corpora in the South African Bantu languages has up till now been limited to the utilisation of raw corpora. Such corpora, however, have limited functionality. Thus the next logical step in any NLP application is the development of software for automatic tagging of electronic texts. The development of a tagset is one of the first steps in corpus annotation. The authors of this article argue that the design of a tagset cannot be isolated from the purpose of the tagset, or from the place of the tagset and its design within the bigger picture of the architecture of corpus annotation. Usage-related aspects therefore feature prominently in the design of the tagset for Northern Sotho. It is explained why this proposed tagset is biased towards human readability, rather than machine readability; this choice of a stochastic tagger is motivated, and the relationship between tokenising, tagging, morphological analysis and parsing is discussed. In order to account at least to some extent for the morphological complexity of Northern Sotho at the tagging level, a multilevel annotation is opted for: the first level comprising obligatory information and the second optional and recommended information. Finally, aspects of standardisation are considered against the background of reuse, of sharing of resources, and of possible adaptation for use by other disjunctively written South African Bantu languages. It is not the aim of this article to evaluate the results of any tagging procedure using the proposed tagset. It only describes the design and motivates the choices made with regard to the tagset design. However, an evaluation is in process and results will be published in the near future (cf. Faaß et al., s.a.).

Highlights

1.1 Context and objectivesSouth Africa is a relative newcomer to the field of HLT, the pool of expertise and skills in this regard is still rather small
This is important with regard to the development of software tools to be used in NLP applications; it is true of African language resources
The class membership feature: The classes 01-15 and the locative classes 16-18 (LOC) are all assigned at the first level of annotation, except the so-called copulative subject concords, which actially function as full copulative and verbs are tagged as such, the question words (QUE), see below

Summary

Context and objectives

South Africa is a relative newcomer to the field of HLT, the pool of expertise and skills in this regard is still rather small. It is of the utmost importance that existing expertise is utilised in the most effective manner This is important with regard to the development of software tools to be used in NLP applications; it is true of African language resources. The University of Pretoria Sepedi Corpus (PSC) being one of these organic corpora, currently stands at about 6,2 million words. Since their initial conception, these corpora have been used in many different applications, e.g. the compilation of wordlists used for the building of spellcheckers, linguistic and terminological research, and translation studies. The logical step would be the annotation of these corpora in order to increase their (re)usability and multi-functionality (cf. Snyman et al, 2007)

On standardisation

Design of a tagset for Northern Sotho

Principles of tagset design

Usage-related aspects

Tagset design within the context of corpus annotation

Linguistic and formal aspects of the design of a Northern Sotho tagset

Current version of the POS tagset for Northern Sotho

More issues in the application of the tagset

Multi-unit tokens

Yet undefined forms

Conclusions and future work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Literator	Publication Date: Jul 25, 2008
Citations: 8	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Literator

Lead the way for us

Similar Papers

Translating technical (LSP) texts into the official South African languages: A corpus-based investigation of translators’ strategies
Rachélle Gauton ... Lufuno Netshitomboni
Language Matters | VOL. 39
Rachélle Gauton, et. al.Rachélle Gauton ... Lufuno Netshitomboni
01 Nov 2008
Language Matters | VOL. 39

Semi-automatic retrieval of definitional information: a northern Sotho case study
E Taljard
Lexikos | VOL. 14
E TaljardE Taljard
18 Feb 2010
Lexikos | VOL. 14

Corpus-based linguistic investigation for the South African Bantu languages: a Northern Sotho case study
Elsabé Taljard
South African Journal of African Languages | VOL. 26
Elsabé TaljardElsabé Taljard
01 Jan 2006
South African Journal of African Languages | VOL. 26

Phonetics of intonation in South African Bantu languages
Sabine Zerbian ... Etienne Barnard
Southern African Linguistics and Applied Language Studies | VOL. 26
Sabine Zerbian, et. al.Sabine Zerbian ... Etienne Barnard
01 Oct 2008
Southern African Linguistics and Applied Language Studies | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the development of a tagset for Northern Sotho with special reference to the issue of standardisation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Literator