Grammatical categories determination for Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of link grammar parser

Aigerim Yerimbetova,Bakzhan Sakenov,Madina Sambetbayeva,Madina Tussupova,Mussa Turdalyuly

doi:10.15587/1729-4061.2021.238743

Abstract

This research is aimed at identifying the parts of speech for the Kazakh and Turkish languages in an information retrieval system. The proposed algorithms are based on machine learning techniques. In this paper, we consider the binary classification of words according to parts of speech. We decided to take the most popular machine learning algorithms. In this paper, the following approaches and well-known machine learning algorithms are studied and considered. We defined 7 dictionaries and tagged 135 million words in Kazakh and 9 dictionaries and 50 million words in the Turkish language. The main problem considered in the paper is to create algorithms for the execution of dictionaries of the so-called Link Grammar Parser (LGP) system, in particular for the Kazakh and Turkish languages, using machine learning techniques. The focus of the research is on the review and comparison of machine learning algorithms and methods that have accomplished results on various natural language processing tasks such as grammatical categories determination. For the operation of the LGP system, a dictionary is created in which a connector for each word is indicated – the type of connection that can be created using this word. The authors considered methods of filling in LGP dictionaries using machine learning. The complexities of natural language processing, however, do not exclude the possibility of identifying narrower tasks that can already be solved algorithmically: for example, determining parts of speech or splitting texts into logical groups. However, some features of natural languages significantly reduce the effectiveness of these solutions. Thus, taking into account all word forms for each word in the Kazakh and Turkish languages increases the complexity of text processing by an order of magnitude

Highlights

Natural language processing (NLP) is considered a major problem in many areas [1]
The research is aimed at creating a scientific and technical groundwork in the field of information and communication technologies and obtaining new knowledge that allows for semantic analysis of texts in natural languages [2]
It is necessary to develop an improved method for the grammatical categories determination for the Turkish and Kazakh languages based on machine learning algorithms and fulfilling dictionaries of the link grammar parser

Summary

Introduction

Natural language processing (NLP) is considered a major problem in many areas [1]. The research is aimed at creating a scientific and technical groundwork in the field of information and communication technologies and obtaining new knowledge that allows for semantic analysis of texts in natural languages [2]. Many researchers are inclined to the need for a deep semantic analysis of texts to create their semantic images, on the basis of which it would be possible to conduct a fine ranking of documents. This approach is undoubtedly the most reasonable, but it requires careful and long work on creating suitable tools for automatic text processing. The semantic analysis of textual information plays a important role

Literature review and problem statement

The aim and objectives of the study

Materials and research methods

Findings

Conclusions