Abstract

The article discusses how collocations are represented in Russian dictionaries and how information about them can be covered in a collocation database that is being developed. Such a resource (gold standard) can be in demand when developing applications for teaching or learning Russian as a foreign language and solving other theoretical and applied issues. The aim of the study was twofold: firstly, to analyze how explanatory and specialized dictionaries of the Russian language represent collocations and hence to what extent their data coincide with each other, and, secondly, to investigate how these dictionary collocations are reflected in text corpora. This allows tracing the relation between manually collected data and modern corpora. For the study, the author used the disambiguated subcorpus and the main corpus of the Russian National Corpus (RNC) with a volume of 6 million and 321 million words, respectively, as well as the large Internet corpus ruTenTen with a volume of more than 14.5 billion words. The author considered attributive phrases built according to the “adjective/participle + noun” model. She analyzed 120 collocations with different dictionary index, i.e. the number of dictionaries in which this phrase is given. The following hypothesis was tested: high collocation frequencies correspond to the fact that the item is recorded in several dictionaries. In the analysis, nonparametric analogues of analysis of variance (Friedman and Kruskal-Wallis tests) were used to assess the statistical significance of differences in quantitative data. The frequencies of collocations in corpora of different volume and in different dictionaries were compared. In total, more than 15 thousand examples were processed, less than 0.5% of them were presented in four of the six reviewed dictionaries (five printed and one electronic). The results show data heterogeneity, items selected for a dictionary do not coincide with their frequency characteristics and thus word combinations turn out to be low-frequency. About 34% of the examples are absent in the RNC corpus with removed ambiguity, and about 12% of analyzed collocations are rare (less than 0.01 ipm) even in the ruTenTen corpus. The presence of collocations in several dictionaries indicates their higher frequencies and hence reproducibility in speech. Explanatory dictionaries and collocation dictionaries show the smallest intersection of data. The results show that the amount of data is a crucial issue, and the very phenomenon of collocability should be studied on large corpora.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call