Abstract

This study aims to analyse an Italian literary corpus from a diachronic perspective using machine learning methods. With reference to a basis of texts written between the 16th and the 21st century, the aim is to apply a well-known robust machine learning (ML) algorithm (Random Forest - RF) in order to see how the texts are classified in four different partitions, representing periodizations theorized by four Italian literature scholars. The corpus we employed for training the ML algorithm includes 420 Italian texts: 100 texts from the 16th century, 27 from the 17th, 57 from the 18th, 100 from the 19th, 100 from the 20th, and 36 from the 21st. In order to vectorize the texts, we used the Author’s Multilevel N-gram Profile (AMNP) (Mikros and Perifanos, 2013; Cortelazzo, Mikros, and Tuzzi, 2018), a document representation method that takes into account a diverse set of linguistic features (i.e., ngrams of increasing length - unigrams, bigrams, trigrams - and ngrams of increasing level - character, word). Each text was split into text chunks of 2000 words in length, and then it was transformed into AMNP vectors. The results of this research have shown an impressive accuracy in classification with the Random Forest algorithm since the precision in the four periodizations reached a minimum value of 89% in the partition-based Migliorini’s theories and a maximum value of 97% in the partition based on Cella’s ones. Looking at the misclassification cases, particularly in Migliorini’s training, it’s interesting to notice that when Random Forest makes a mistake in classifying text chunks into a century, its error is usually of +/- 1 century.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.