СТВОРЕННЯ ВЛАСНИХ ЛІНГВІСТИЧНИХ КОРПУСІВ

O Kozoriz

doi:10.17721/1728-242x.2021.27.6-12

Abstract

The problem of creation of own corpora of parallel texts of large volumes is investigated in the article. The technique and criteria of construction of parallel linguistic corpora are offered. As a result of our research, we created a combined corpus of 3,850,000 pairs of sentences or 65 million words of the English part, which is 10 % of the known COCA corpus or GRAC corpus. Methods for downloading material for the corpus based on the frequency list, terminological dictionaries, as well as frequency lists of words of previously self-created corpora proved to be effective. Theoretical investigations and practical researches for normalization of the corpus are carried out. The type / token ratio, the automatic readability index ARI, the average sentence length ASL, etc. were effective for the study of the corpus. The construction of graphs of the distribution of vocabulary by frequency and length of sentences in the corpus clearly illustrates the results of our research, effectively represents the material. We can also talk about the successful experience of creating narrow specialized terminological corpora as opposed to terminological dictionaries for further research of functional features, sentence models of a particular terminological system. Medical and biological corpora (about 500 thousand pairs of sentences each), as well as a polytechnic corpora for 1.3 million were compiled. A total of eight corpora were compiled, for five of them the total number of characters, words and sentences in the corpus with the corresponding summary table was calculated; the average length of sentences ASL was determined, the automatic readability index ARI were determined, the ratio type/token ratio TTR was calculated. For each corpora frequency lists of vocabulary are made, the total amount of unique vocabulary is calculated and the corresponding logarithmic graphs are constructed; proposed method of analysis of the distribution of vocabulary of the frequency dictionary of the text on the basis of graphs by dividing them into three parts: initial, middle and tail – is considered promising for us.

Full Text