Size of corpora and collocations: The case of Russian

Maria Khokhlova,Vladimir Benko

doi:10.4312/slo2.0.2020.2.58-77

Abstract

With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora better for linguistic research or, more precisely, do lexicographers need to analyze bigger amounts of collocations? The paper deals with experiments on collocation identification in low-frequency lexis using corpora of different volumes (1 million, 10 million, 100 million and 1.2 billion words). We have selected low-frequency adjectives, nouns and verbs in the Russian Frequency Dictionary and tested the following hypotheses: 1) collocations in low-frequency lexis are better represented by larger corpora; 2) frequent collocations presented in dictionaries have low occurrences in small corpora; 3) statistical measures for collocation extraction behave differently in corpora of different volumes. The results prove the fact that corpora of under 100 M are not representative enough to study collocations, especially those with nouns and verbs. MI and Dice tend to extract less reliable collocations as the corpus volume extends, whereas t-score and Fisher’s exact test demonstrate better results for larger corpora.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave	Publication Date: Aug 10, 2020
Citations: 1	License type: CC BY-SA 4.0

R Discovery Prime

R Discovery Prime

Size of corpora and collocations: The case of Russian

Abstract

Talk to us

Similar Papers

More From: Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave

Lead the way for us

Similar Papers

Constitución de corpus crecientes del español
Mark Davies ... Giovanni Parodi
-
Mark Davies, et. al.Mark Davies ... Giovanni Parodi
25 Jan 2022
25 Jan 2022

Improvement of Language Models Using Dual-Source Backoff
Sehyeong Cho
-
Sehyeong ChoSehyeong Cho
01 Jan 2004
01 Jan 2004

Overcoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics
Sehyeong Cho ... Sanghun Kim
-
Sehyeong Cho, et. al.Sehyeong Cho ... Sanghun Kim
01 Jan 2004
01 Jan 2004

Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT
Shoya Wada ... Jun Kamohara
Artificial Intelligence In Medicine | VOL. 153
Shoya Wada, et. al.Shoya Wada ... Jun Kamohara
05 May 2024
Artificial Intelligence In Medicine | VOL. 153

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Size of corpora and collocations: The case of Russian

Abstract

Talk to us

Similar Papers

More From: Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave