A gender identification of text author in mixture of Russian multi-genre texts with distortions on base of data-driven approach using machine learning models

Aleksandr Sboev,Dmitry Gudovskikh,Roman Rybka,Ivan Moloshnikov

doi:10.1063/1.5114280

Abstract

In this work we investigate a wide set of machine learning models of data-driven approaches (Long Short-Term Memory networks, Convolutional neural networks, multilayer perceptrons, Random Forest Classifiers, Logistic Regression and Gradient Boosting Classifiers with different sets of features) to identify the gender of author in Russian multi-genre texts in the case of existing style distortions and gender deceptions in training and testing sets. We consider and evaluate accuracy for the following situations: the influence of style distortions and gender deceptions in training texts for different genre, and the case when such deception is present only in test results. A comparison with known literature data is presented.The set of data corpora includes: one collected by a crowdsourcing platform, essays of Russian students (RusPersonality), Gender Imitation corpus, and the corpora used at Forum for Information Retrieval Evaluation 2017 (FIRE), containing texts from Facebook, Twitter and Reviews. We present the analysis of numerical experiments based on different features (morphological data, vector of character n-gram frequencies, LIWC and others) of input texts along with various machine learning models. The presented results, obtained on a wide set of data-driven models, establish the accuracy level for the task to identify gender of an author of a Russian text in the multi-genre case and analyzed the effect of the presence of deception in the test and training sets.

Full Text