Abstract

The paper discusses several key concepts related to the development of corpora and reconsiders them in light of recent developments in NLP. On the basis of an overview of present-day corpora, we conclude that the dominant practices of corpus design do not utilise adequately the technologies and, as a result, fail to meet the demands of corpus linguistics, computational lexicology and computational linguistics alike.We proceed to lay out a data-driven approach to corpus design, which integrates the best practices of traditional corpus linguistics with the potential of the latest technologies allowing fast collection, automatic metadata description and annotation of large amounts of data. Thus, the gist of the approach we propose is that corpus design should be centred on amassing large amounts of mono- and multilingual texts and on providing them with a detailed metadata description and high-quality multi-level annotation.We go on to illustrate this concept with a description of the compilation, structuring, documentation, and annotation of the Bulgarian National Corpus (BulNC). At present it consists of a Bulgarian part of 979.6 million words, constituting the corpus kernel, and 33 Bulgarian-X language corpora, totalling 972.3 million words, 1.95 billion words altogether. The BulNC is supplied with a comprehensive metadata description, which allows us to organise the texts according to different principles. The Bulgarian part of the BulNC is automatically processed (tokenised and sentence split) and annotated at several levels: morphosyntactic tagging, lemmatisation, word-sense annotation, annotation of noun phrases and named entities. Some levels of annotation are also applied to the Bulgarian-English parallel corpus with the prospect of expanding multilingual annotation both in terms of linguistic levels and the number of languages for which it is available. We conclude with a brief evaluation of the quality of the corpus and an outline of its applications in NLP and linguistic research.

Highlights

  • Since the first structured electronic corpus, the Brown Corpus (Francis and Kučera, 1964), corpora have been increasingly used as a source of authentic linguistic data for theoretical and applied research

  • A corpus is typically viewed as a collection of authentic linguistic data that may be used in linguistic research (Garside et al, 1997)

  • With the increased development of language technologies, the applications of corpora have been extended to all areas of computational linguistics and natural language processing (NLP)

Read more

Summary

Large monolingual corpora

A number of large corpora have recently come into existence, with size ranging from several (Baroni and Kilgarriff, 2006; Pomikálek et al, 2009), through dozens (Pomikálek et al, 2012), to hundreds of billions of words (Google Books Corpora, GBC17, the largest being the 200-billion-word GBC of American English) What distinguishes these from the rest of the discussed corpora is that they represent a different type of approach to corpus creation, since they are collected fully automatically from web content. Exact words or phrases, regular expressions, POS, lemma, collocations, frequency and distribution of synonyms, with further refinement in terms of genre or time period This brief outline shows that the dominant and constant tendency is for corpora to aim at a size ranging from several hundred million up to over a billion words. Compilation of very large unbalanced corpora from the web whose structure and content are not concerned with balance and representativeness (9)

Large parallel corpora
An overview of Bulgarian corpora
Corpus size revisited
Balance and representativeness reconsidered
Extended metadata and linguistic annotation
Compilation of the BulNC
Size of the Bulgarian National Corpus
Structure of the Bulgarian National Corpus
Features of the text
Documentation and annotation
Text metadata
Monolingual annotation
Multilingual annotation
Annotation formats
General evaluation of the BulNC
Public access to the BulNC
Findings
Specialised subcorpora
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call