Abstract

This article presents various approaches used in corpus-based computational lexico-graphy. A claim is made that in order for computational lexicography to be efficient, precise and comprehensive, it should utilize the method where the corpus text is first analysed, and the results of this analysis is then processed further to meet the needs of a dictionary. This method has several advantages, including high precision and recall, as well as the possibility to automate the process much further than with more traditional computational methods. The frequency list obtained by using the lemma (the equivalent of the headword) as basis helps in selecting the words to be in-cluded in the dictionary. The approach is demonstrated through various phases by applying SALAMA (the Swahili Language Manager) to the process. Manual work will be needed in the phase when examples of use are selected from the corpus, and possibly modified. However, the list of examples of use, arranged alphabetically according to the corresponding headword, can also be produced automatically. Thus the alphabetical list of headwords with examples of use is the mate-rial on which the lexicographer works manually. The article deals with problems encountered in compiling traditional printed dictionaries, and it excludes electronic dictionaries and thesauri. Keywords: lexicography, dictionary, language technology, computa-tional linguistics, automatic compilation, dictionary testing, informa-tion retrieval, morphological analysis, semantic analysis, disambigua-tion, heuristics

Highlights

  • The use of computers in lexicographical work has gone through various phases, where enthusiasm on the one hand and disappointment on the other have alternated

  • The results show that the monolingual dictionary Kamusi ya Kiswahili Sanifu (KKS) was able to recognize between 89.7 and 91.8% of the words of the three corpora, and Kamusi ya Kiswahili–Kiingereza (KKK) recognized 90.7 to 92.9% of the words

  • After a fairly long period of research and testing, computational lexicography has reached a stage where computers and corpora can be put into effective use

Read more

Summary

Introduction

The use of computers in lexicographical work has gone through various phases, where enthusiasm on the one hand and disappointment on the other have alternated. The automatic concordancing was, a huge improvement compared with manual compilation, but there was nothing linguistically intelligent in it These retrieving programs, often called KWIC (Key Word In Context), continue to be standard tools in dictionary work, but they are suitable only for selected tasks. In order for the computer-based lexicographical work to be really meaningful, the computer system used for the work has to acquire and make explicit the linguistic information attached to each of the potential lexemes in the dictionary The computer system designed for lexicographical work should be able to address each of these problems and solve them This calls for a full computational description of a language, a description that in great detail makes use of linguistic rules and is lexically comprehensive. Work on the computer description of this language started in 1985, and has reached a phase where almost all the problems have at least been addressed, and most of them solved. The system will be briefly described phase by phase, and by means of examples it will be shown how the system can be applied for dictionary compilation

Choice of headwords
Format of the corpus
Direct string search — traditional approach
String search with regular expressions
Advanced approach — analyse text first
The problem of ambiguity
Removing excessive tags
Post-processing of the analysed corpus
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call