New advances in corpus-based lexicography

A Hurskainen

doi:10.4314/lex.v13i1.51383

Abstract

This article presents various approaches used in corpus-based computational lexico-graphy. A claim is made that in order for computational lexicography to be efficient, precise and comprehensive, it should utilize the method where the corpus text is first analysed, and the results of this analysis is then processed further to meet the needs of a dictionary. This method has several advantages, including high precision and recall, as well as the possibility to automate the process much further than with more traditional computational methods. The frequency list obtained by using the lemma (the equivalent of the headword) as basis helps in selecting the words to be in-cluded in the dictionary. The approach is demonstrated through various phases by applying SALAMA (the Swahili Language Manager) to the process. Manual work will be needed in the phase when examples of use are selected from the corpus, and possibly modified. However, the list of examples of use, arranged alphabetically according to the corresponding headword, can also be produced automatically. Thus the alphabetical list of headwords with examples of use is the mate-rial on which the lexicographer works manually. The article deals with problems encountered in compiling traditional printed dictionaries, and it excludes electronic dictionaries and thesauri. Keywords: lexicography, dictionary, language technology, computa-tional linguistics, automatic compilation, dictionary testing, informa-tion retrieval, morphological analysis, semantic analysis, disambigua-tion, heuristics

Highlights

The use of computers in lexicographical work has gone through various phases, where enthusiasm on the one hand and disappointment on the other have alternated
The results show that the monolingual dictionary Kamusi ya Kiswahili Sanifu (KKS) was able to recognize between 89.7 and 91.8% of the words of the three corpora, and Kamusi ya Kiswahili–Kiingereza (KKK) recognized 90.7 to 92.9% of the words
After a fairly long period of research and testing, computational lexicography has reached a stage where computers and corpora can be put into effective use

Summary

Introduction

The use of computers in lexicographical work has gone through various phases, where enthusiasm on the one hand and disappointment on the other have alternated. The automatic concordancing was, a huge improvement compared with manual compilation, but there was nothing linguistically intelligent in it These retrieving programs, often called KWIC (Key Word In Context), continue to be standard tools in dictionary work, but they are suitable only for selected tasks. In order for the computer-based lexicographical work to be really meaningful, the computer system used for the work has to acquire and make explicit the linguistic information attached to each of the potential lexemes in the dictionary The computer system designed for lexicographical work should be able to address each of these problems and solve them This calls for a full computational description of a language, a description that in great detail makes use of linguistic rules and is lexically comprehensive. Work on the computer description of this language started in 1985, and has reached a phase where almost all the problems have at least been addressed, and most of them solved. The system will be briefly described phase by phase, and by means of examples it will be shown how the system can be applied for dictionary compilation

Choice of headwords

Format of the corpus

Direct string search — traditional approach

String search with regular expressions

Advanced approach — analyse text first

The problem of ambiguity

Removing excessive tags

Post-processing of the analysed corpus

Findings

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

New advances in corpus-based lexicography

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Lexikos

Lead the way for us

Journal: Lexikos	Publication Date: Feb 18, 2010
License type: cc-by

Similar Papers

Salton Award Lecture - Information retrieval and computer science
W Bruce Croft
-
W Bruce CroftW Bruce Croft
28 Jul 2003
28 Jul 2003

Development of the algorithm of keyword search in the Kazakh language text corpus
Akerke Akanova ... Yevgeniya Kukharenko
Eastern-European Journal of Enterprise Technologies | VOL. 5
Akerke Akanova, et. al.Akerke Akanova ... Yevgeniya Kukharenko
24 Sep 2019
Eastern-European Journal of Enterprise Technologies | VOL. 5

Systematic Review of Morphological and Semantic Analysis in a Low Resource Language
P Matan ... P Velvizhy
-
P Matan, et. al.P Matan ... P Velvizhy
27 Feb 2024
27 Feb 2024

Proceedings of the 5th conference on Computational linguistics -
-
-
--
01 Jan 1973
01 Jan 1973

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

New advances in corpus-based lexicography

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Lexikos