PARALLEL CORPUS OF THE KAZAKH AND RUSSIAN LANGUAGES: DEVELOPMENT, OPERATION AND PROBLEMS

N M Ashimbaeva,S K Kulmanov,M Nurlan,G M Ayazbaev,A Z Bisengali

doi:10.55491/2411-6076-2023-2-49-61

Abstract

The research paper gives a brief overview of the history of the creation of linguistic corpora, describes their classification according to various criteria and types of parallel subcorpuses. The original Kazakh text of M. Auezov's epic novel «Abai Zholy» and its Russian translation, made by A. Kim, were manually aligned at the level of a paragraph (sentence) in a parallel subcorpus being developed as part of the national corpus of the Kazakh language.During the development of the parallel subcorpus, Microsoft Office Excel, Notepad++, Python, Django, MySQL software tools were used. The software architecture and the order of operation of the parallel subcorpus can be represented as follows: 1) texts in two languages were collected using the Excel office program and aligned manually at the paragraph (sentence) level; 2) aligned texts were loaded directly from an Excel file into the MySQL database management system; 3) the downloaded texts were sorted using the Notepad++ word processor program, their statistics were obtained; 4) the Django web server was used to publish the sorted texts on the Internet and provide user requests; 5) the Processing.py program written in Python and equipped with a search function was used to connect the Django web server to the MySQL database management system; 6) the parallel subcorpus software architecture was developed using client-server and MVC (Model-View-Controller) technologies.The parallel subcorpus consists of a database of aligned texts, markups, metamarkups and a search engine, information about the text entered into the subcorpus (metamarkup) includes the following parameters: author, translator, work title, translation title, publication date of the work, translation period, original language, translation language. The search engine allows users to find the desired word by parameters: word, phrase, sentence, and capital letters (in Kazakh and Russian). The paper describes the interface of the parallel subcorpus in Kazakh and Russian and the interface of the results after searching for the desired word through one of the search parameters, the total and non-repeating number of words used in the text in two languages, the number of sentences, as well as numerical and percentage values of the ten most commonly used words in both languages were determined.In addition, in the process of aligning the original Kazakh text of the epic novel with the Russian translated version at the paragraph (sentence) level, the following features were identified: 1) from the point of view of structure, that is, the words used in the paragraph (sentence) are approximately equivalent in number; 2) from the point of view of content, they approximately coincide; 3) do not coincide in structure and content: some paragraphs (sentences) in the original text in Kazakh are translated into Russian incorrectly, superficially or briefly, their approximate meaning is given.

Full Text