Abstract

This paper describes the resources and software procedures used or developed in a major enabling step towards the revision of the scholarly reference work A Dictionary of South African English on Historical Principles ( DSAE , Silva et al. 1996), namely the semi-automatic generation of a digitally-sourced lexical database on which new and updated dictionary entries will be based; as well as the addition, in parallel, of a new corpus of South African English (SAE) to the project. Drawing on online data sources and an extensive list of known SAE word forms, we have developed a software toolchain to gather, encode, annotate and collate textual sources, producing: (i) a 3.1-billion part-of-speech-annotated corpus of South African English; (ii) a lexical database of illustrative quotations for over 20,000 known SAE word forms, available for selection at the entry-revision stage; and (iii) a list of potential new variant spellings and headword inclusion candidates. These steps replace, where recent electronic sources are concerned, the mechanical aspects of quotation gathering, normally undertaken manually through a reading programme requiring years of teamwork to acquire sufficient coverage (cf. Hicks 2010).

Highlights

  • Opsomming: Die semi-outomatisering van die leesprogramme van 'n historiese woordeboekprojek

  • A Dictionary of South African English on Historical Principles (DSAE, Silva et al 1996) is a diachronic variety dictionary, first published as a single-volume print dictionary spanning about 800 pages and available as a pilot online edition at http:// dsae.co.za since 2014

  • Much of the DSAE's compilation process was directed towards an ongoing reading programme

Read more

Summary

Role of quotations in the dictionary

A Dictionary of South African English on Historical Principles (DSAE, Silva et al 1996) is a diachronic variety dictionary, first published as a single-volume print dictionary spanning about 800 pages and available as a pilot online edition at http:// dsae.co.za since 2014. With the help of numerous volunteer readers, approximately 300,000 index card citations were collected as illustrative evidence for dictionary entries, their sense-divisions as they evolve through time, and nested lemmas. Of these about 45,000 quotations were included in the printed version of the dictionary, resulting in an average of 10 quotations per entry and producing a full running text of about 1,5 million words.

The need for new quotations
Typical quotation-gathering stages
Input data sources
Newspaper Corpus
Web Corpus
Annotated corpus and corpus query system
General overview
Input: SAE dictionary search list
Analysis of new headword candidates unrecognised by the TreeTagger
Detection of new variants based on word similarity
Detection of new headword candidates based on word similarity
Detection of headword candidates using term extraction
Re-orientation of reading programme prompted by semi-automation
Conclusion
Findings
10. References
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.