Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm

Ksenia Vladimirovna Lagutina

doi:10.18255/1818-1015-2022-4-334-347

Ksenia Vladimirovna Lagutina

Open Access

https://doi.org/10.18255/1818-1015-2022-4-334-347

Copy DOI

Abstract

The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Modeling and Analysis of Information Systems	Publication Date: Dec 18, 2022
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm

Abstract

Talk to us

Similar Papers

More From: Modeling and Analysis of Information Systems

Lead the way for us

Similar Papers

Text Classification by Genre Based on Rhythm Features
Ksenia Vladimirovna Lagutina ... Elena Igorevna Boychuk
Modeling and Analysis of Information Systems | VOL. 28
Ksenia Vladimirovna Lagutina, et. al.Ksenia Vladimirovna Lagutina ... Elena Igorevna Boychuk
14 Oct 2021
Modeling and Analysis of Information Systems | VOL. 28

Text classification by CEFR levels using machine learning methods and BERT language model
Nadezhda S Lagutina ... Natalia N Kasatkina
Modeling and Analysis of Information Systems | VOL. 30
Nadezhda S Lagutina, et. al.Nadezhda S Lagutina ... Natalia N Kasatkina
17 Sep 2023
Modeling and Analysis of Information Systems | VOL. 30

Natural Language Processing Applications in Case-Law Text Publishing
Francesco Tarasconi ... Luca Vignati
-
Francesco Tarasconi, et. al.Francesco Tarasconi ... Luca Vignati
01 Dec 2020
01 Dec 2020

Text classification of electricity policy information based on BERT-optimized TextRNN
Zhiyong Liu
-
Zhiyong LiuZhiyong Liu
01 Nov 2022
01 Nov 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm

Abstract

Talk to us

Similar Papers

More From: Modeling and Analysis of Information Systems