Abstract

This article is dedicated to the analysis of various stylometric characteristics combinations of different levels for the quality of verification of authorship of Russian, English and French prose texts. The research was carried out for both low-level stylometric characteristics based on words and symbols and higher-level structural characteristics.All stylometric characteristics were calculated automatically with the help of the ProseRhythmDetector program. This approach gave a possibility to analyze the works of a large volume and of many writers at the same time. During the work, vectors of stylometric characteristics of the level of symbols, words and structure were compared to each text. During the experiments, the sets of parameters of these three levels were combined with each other in all possible ways. The resulting vectors of stylometric characteristics were applied to the input of various classifiers to perform verification and identify the most appropriate classifier for solving the problem. The best results were obtained with the help of the AdaBoost classifier. The average F-score for all languages turned out to be more than 92 %. Detailed assessments of the quality of verification are given and analyzed for each author. Use of high-level stylometric characteristics, in particular, frequency of using N-grams of POS tags, offers the prospect of a more detailed analysis of the style of one or another author. The results of the experiments show that when the characteristics of the structure level are combined with the characteristics of the level of words and / or symbols, the most accurate results of verification of authorship for literary texts in Russian, English and French are obtained. Additionally, the authors were able to conclude about a different degree of impact of stylometric characteristics for the quality of verification of authorship for different languages.

Highlights

  • E average F-score for all languages turned out to be more than 92 %

  • Use of high-level stylometric characteristics, in particular, frequency of using Ngrams of POS tags, o ers the prospect of a more detailed analysis of the style of one or another author. e results of the experiments show that when the characteristics of the structure level are combined with the characteristics of the level of words and / or symbols, the most accurate results of veri cation of authorship for literary texts in Russian, English and French are obtained

  • The authors were able to conclude about a di erent degree of impact of stylometric characteristics for the quality of veri cation of authorship for di erent languages

Read more

Summary

Современное состояние исследований в области верификации авторства текстов

Задача верификации авторства может рассматриваться как математическая задача бинарной классификации, принадлежит ли рассматриваемый документ определённому классу или нет. Авторы работы [10] обращают внимание, что надежность использования таких параметров в алгоритмах машинного обучения значительно снижается для коротких и тематически разнообразных текстов в социальных сетях. Другие исследователи использовали простое моделирование текста с помощью эмбеддинга на основе векторного представления слов Word2vec [11] для верификации авторства коротких статей на английском языке. Что особенности, основанные на предложениях, не повлияли на качество классификации, можно объяснить особенностью коротких сообщений в социальных сетях, поскольку они редко состоят из большого количества предложений. Выделение нескольких групп параметров текста на разных уровнях: символов, слов, структуры предложений, и исследование их влияния на качество верификации авторства является актуальной задачей в области автоматической обработки естественного языка

Обзор характеристик
Классификация текстов
Классификатор Random Forest
Классификатор SVM
Классификатор Gaussian Naive Bayes
Корпус текстов
Постановка экспериментов
Результаты экспериментов
A Trollope
C Kingsley
Findings
Заключение
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call