Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Aleksandr Romanov,Anna Kurtukova,Valery Goncharov,Alexander Shelupanov,Anastasia Fedotova

doi:10.3390/fi13010003

Abstract

The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.

Highlights

Published: 25 December 2020It is known that it is possible to determine the individual characteristics of the author on the basis of the writing style, since each text has a specific linguistic personality [1].The topic of attribution overlaps with information security [2,3,4,5]
During the course of the research, the authors analyzed modern approaches to determining the author of a natural-language text, implemented approaches of authorship attribution based on support vector machine (SVM) and deep neural networks (NN) architectures, evaluated the developed approaches on different numbers of authors and volumes of texts, and evaluated the resistance of the approaches to anonymization techniques
Despite the great popularity of deep NNs architectures, they are inferior to the traditional SVM machine learning algorithms in accuracy by more than 10% on average

Summary

Introduction

Published: 25 December 2020It is known that it is possible to determine the individual characteristics of the author on the basis of the writing style, since each text has a specific linguistic personality [1].The topic of attribution overlaps with information security [2,3,4,5]. It is known that it is possible to determine the individual characteristics of the author on the basis of the writing style, since each text has a specific linguistic personality [1]. Quite often there are situations related to hacking the victim’s social media accounts and sending messages on the victim’s behalf. One solution to this kind of problem is to compare the writing style of the suspicious texts with others for which it is certain that they were written by the person. Establishing general differences in the documents based on the writing style is most relevant if there are no other data that would allow the author to be identified

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Future Internet	Publication Date: Dec 25, 2020
Citations: 15	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Future Internet

Lead the way for us

Similar Papers

A Generic OCR Using Deep Siamese Convolution Neural Networks
Ghada Sokar ... Elsayed E Hemayed
-
Ghada Sokar, et. al.Ghada Sokar ... Elsayed E Hemayed
01 Nov 2018
01 Nov 2018

Music emotion recognition using convolutional long short term memory deep neural networks
Serhat Hizlisoy ... Zekeriya Tufekci
Engineering Science and Technology, an International Journal | VOL. 24
Serhat Hizlisoy, et. al.Serhat Hizlisoy ... Zekeriya Tufekci
14 Nov 2020
Engineering Science and Technology, an International Journal | VOL. 24

Bottleneck and Embedding Representation of Speech for DNN-based Language and Speaker Recognition
Alicia Lozano-Diez ... Javier Gonzalez-Dominguez
-
Alicia Lozano-Diez, et. al.Alicia Lozano-Diez ... Javier Gonzalez-Dominguez
21 Nov 2018
21 Nov 2018

Heart rate variability-derived features based on deep neural network for distinguishing different anaesthesia states
Jian Zhan ... Hong Li
BMC Anesthesiology | VOL. 21
Jian Zhan, et. al.Jian Zhan ... Hong Li
02 Mar 2021
BMC Anesthesiology | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Future Internet