Corpus-based Lexicography for Lesser-resourced Languages — Maximizing the Limited Corpus

D.J Prinsloo

doi:10.5788/25-1-1300

D.J Prinsloo

Open Access

PDF Available

https://doi.org/10.5788/25-1-1300

Copy DOI

Export

Save

Cite

Journal: Lexikos	Publication Date: Nov 1, 2015
Citations: 6	License type: cc-by

Affiliation: University of Pretoria

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

This article focuses on lesser-resourced languages for which only very limited corpora are available and how such relatively small and often unbalanced, raw corpora could be maximally utilized for lexicographic purposes to obtain similar results as for corpora. Sepedi and Afrikaans will be studied in this regard. The aim is to determine to what extent enlarging a corpus from e.g. one to 10 million, and from 10 million to 100 million words enhances its potential for (a) macrostructure compilation, (b) sourcing information on the most important microstructural aspects and (c) the creation of lexicographic tools. It will be argued that valuable and even sufficient data for the compilation of a specific dictionary can be extracted from a relatively small corpus of approximately one million words but that bigger in some instances indeed means better.

Highlights

The days of a default corpus size of one million words such as the groundbreaking first computer-readable general text corpus, the Brown Corpus of Stan-Lexikos 25 (AFRILEX-reeks/series 25: 2015): 285-300286 D.J
Corpora for major languages typically run into hundreds of millions and even billions of words, for example Google Books with 155 billion for American English, 45 billion for Spanish and 34 billion for British English, and are typically referred to as "big corpora"
This article, focuses on lesser-resourced languages for which only very limited corpora are available and how such relatively small and often unbalanced, raw corpora could be maximally utilized for lexicographic purposes to obtain similar results in the absence of large corpora

Summary

Introduction

The days of a default corpus size of one million words such as the groundbreaking first computer-readable general text corpus, the Brown Corpus of Stan-Lexikos 25 (AFRILEX-reeks/series 25: 2015): 285-300286 D.J. It will be argued that valuable and even sufficient data for the compilation of a specific dictionary can be extracted from a relatively small corpus of approximately one million words.

Results

Conclusion