Identification of authorship of Ukrainian-language texts of journalistic style using neural networks

Maksym Lupei,Alexander Mitsa,Vasyl Sharkan,Volodymyr Repariuk

doi:10.15587/1729-4061.2020.195041

Maksym Lupei, Alexander Mitsa + Show 2 more

Open Access

https://doi.org/10.15587/1729-4061.2020.195041

Copy DOI

Abstract

The problem of development of an effective method for text authorship identification (on the material of publications of well-known Ukrainian journalists) is explored. Most existing methods require text preprocessing, which entails new costs when solving the set problem. In the case where the number of possible authors can be minimized, this approach is often excessive. Another disadvantage of the existing approaches is that their vast majority was applied to texts in foreign languages and did not take into consideration the peculiarities of the Ukrainian language. Therefore, it was decided to develop an approach that makes it possible to identify the author of the text in Ukrainian without preprocessing and give high accuracy results, as well as to establish what types of artificial neural networks provide the minimum error for Ukrainian publicists.The developed method uses a multilayer perceptron of direct distribution, the algorithm of supervised learning, vectorization HashingVectorizer, and Adam optimizer. It was determined that with a small number of iterations (4–5 iterations) of artificial neural network learning, we obtain a rather high accuracy of identification of authorship of journalistic texts and rather small value of error. Over 1,000 fragments of texts by three Ukrainian authors were used. As a result of the conducted experiments, it was found that the application of the developed approach to solving the set problem enables achieving rather high results. In the texts containing not less than 500 characters, the accuracy reaches 91 %, and the maximum number of iterations of artificial neural network learning does not exceed 15. Such results were achieved primarily due to the efficient selection of the vectorization method at the preparatory stage and the structure of an artificial neural network

Highlights

With the advancement of technology, artificial neural networks are increasingly being used to solve certain tasks that take a lot longer for a person than a computer to solve
In article [9], two approaches based on multiple-discriminant analysis (MDA) and support vector machine (SVM) were proposed
The developed approach ensures high accuracy, which corresponds to the level of the most effective methods

Summary

Introduction

With the advancement of technology, artificial neural networks are increasingly being used to solve certain tasks that take a lot longer for a person than a computer to solve. Some of such relevant issues include the identification of the primary source, identification of the authorship of anonymous texts, fight against plagiarism, determining belonging of a text to a certain author during legal expert examination. There are many approaches to solving them, based on different methods, and yielding different results of accuracy. The issue of developing a universal method that will produce the best results, that is, will provide the highest accuracy in authorship identification with the consumption of fewer resources, remains unresolved. Given exactly the specificity of a particular language, one can develop an effective approach to solving the problems of authorship identification

Literature review and problem statement

The aim and objectives of the study

Results of experiments

Conclusions