Abstract

The use of text corpora has increased considerably in the past few years, not only in the field of lexicography but also in computational linguistics and language technology. Consequently, corpus data and expertise developed by lexicographical institutions have gained a broader scope of application. In the European context this has led to a revised view of corpus design. In line with these developments, the Institute for Dutch Lexicology (INL) has since 1994 been providing external access to steadily improving corpora via Internet. In August 1996, the <i>38 Million Words Corpus</i> was available for consultation by the international research community. The present paper reports on the characteristics of this corpus (design, text classification, linguistic annotation) and on its use, both in dictionary projects and in linguistic research. In spite of limitations with respect to corpus design, the INL corpora accessible via Internet have proved to meet external needs. By providing these facilities, the INL has acquired a much broader experience in corpus-building than before, which is essential for new, internal dictionary projects. Giving external access to corpus data which was developed primarily for internal purposes, may be profitable for all parties involved.

Highlights

  • The use of text corpora has increased considerably in the past few years, in the field of lexicography and in computational linguistics and language technology

  • Outside the field of leXicography, large corpora have become important for computational linguistics (Church and Mercer 1993) and language technology

  • From the perspective of a European infrastructure for language technology, the European Commission considered the corpus data and expertise developed by leXicographical institutions important enough to support projects in which the institutions contribute to the realization of the intended European infrastructure

Read more

Summary

Introd uction

Large electronic text corpora of national languages were developed mainly for lexicographical purposes (Zampolli and Cappelli 1983). Corpus size (very large corpora) rather than corpus design is considered essential by many computational linguists using statistical methods of language analysis Corpus practice demonstrates that lexicographical corpora for standard-language dictionaries may have very different corpus designs (Kruyt and Putter 1992, Kruyt and Van Sterkenburg 1996). A 38 Million Words Dutch Text Corpus and its Users 231 reusable, multifunctional and harmonized reference corpora for the European languages (Zampolli 1996). At the end of August 1996, a 38 Million Words Corpus with a diversified composition was made available in a similar way This corpus is different from the former ones in various aspects: (a) size, (b) a broader coverage with respect to topic (subject domain), text types The use of the other INL corpora accessible via Internet will be discussed

Composition
Text classification
Access to the corpus data
Use of the 38 Million Words Corpus
Use by lexicographers
Use by individuals
Findings
Conclusion and discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call