Abstract

This paper illustrates the uses of a tagged pilot corpus of spoken Cameroon Pidgin English (CPE), which has recently been finalised (Ozon et al. 2017) and made available on line (Green et al. 2016). The corpus consists of 240,000 words, with mark-up and part-of-speechtagging. The text categories and the proportions of monologue and dialogue are in line with those of the ICE project (Nelson 1996), making the CPE corpus directly comparable with existing corpora of post-colonial Englishes. The project necessitated the development of a designated tagset for CPE, which was employed to tag the corpus automatically with Tree Tagger (Schmid 1994), for which 94% accuracy was achieved. This tagged corpus offers an invaluable resource for the investigation of CPE, and is particularly useful for automatic retrieval of language phenomena above the level of the lexicon, for which a substantially larger corpus is required. The tagging in particular is instrumental in addressing issues of multifunctionality characteristic of pidgin/creole languages. For example, certain verbs (e.g. goe ‘go’, kam ‘come’, gif ‘give’ and teik ’take’) can function independently as lexical verbs and can also participate in serial verb constructions (SVCs) in CPE. The tagged corpus makes a distinction between the different uses of these verbs, allowing automatic retrieval with a simple search. We introduce the dataset and present some case studies illustrating its potential uses, in order to highlight the usefulness of such freely accessible resources for research on African languages.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.