Comparison of Style Features for the Authorship Verification of Literary Texts

Ksenia Vladimirovna Lagutina

doi:10.18255/1818-1015-2021-3-250-259

Ksenia Vladimirovna Lagutina

Open Access

https://doi.org/10.18255/1818-1015-2021-3-250-259

Copy DOI

Abstract

The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.

Highlights

Style featuresWe compare three types of features: character-level, word-level, and rhythm-level ones. e rst two feature types are the popular e ective features from the state-of-the art. e rhythm features describe the speci c style marks of the authors that frequently appear in literary texts
E authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts
Authorship veri cation is considered as a binary classi cation problem: whether the text belongs to a particular author or not

Summary

Style features

We compare three types of features: character-level, word-level, and rhythm-level ones. e rst two feature types are the popular e ective features from the state-of-the art. e rhythm features describe the speci c style marks of the authors that frequently appear in literary texts. We compare three types of features: character-level, word-level, and rhythm-level ones. E rhythm features describe the speci c style marks of the authors that frequently appear in literary texts. – Average sentence length in characters including punctuation marks and spaces. – Frequencies of occurrences of each le er among all le ers. – Frequencies of occurrences of each punctuation mark Word-level features: – Average sentence length in words. Rhythm features: – e density of the gure — the number of occurrences of the rhythm gure (anaphora, epiphora, etc.) divided by the number of sentences. Character and word-level features represent the base statistics of the text style. Rhythm features represent the density and linguistic structure of the text rhythm. The text is modeled as the vector of statistical and linguistic features

Authorship veri cation

Experiments

Findings

Conclusion