Abstract

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.

Highlights

  • Statistical methods have long been applied to analyze many complex systems [1,2,3,4,5], including written texts and language patterns [6], which include network representations of text to investigate linguistic phenomena [7,8,9,10,11,12,13,14]

  • The overall structure and dynamics of networks representing texts have been modeled to describe their mechanism of growth and attachment [29, 30], while nuances in the topology of real networks were exploited in practical problems, including natural language processing [31,32,33,34]

  • In both cases the best results are obtained with an intermediate number of metrics: we begin by removing misleading attributes, improving classification; at the end of the process most of the attributes

Read more

Summary

Introduction

Statistical methods have long been applied to analyze many complex systems [1,2,3,4,5], including written texts and language patterns [6], which include network representations of text to investigate linguistic phenomena [7,8,9,10,11,12,13,14]. Examples of language-related networks include phonological networks with modular or cut-off power-law behaviors [20,21,22,23], semantic similarity networks with small-world and scale-free properties [24], syntactic dependency networks with hierarchical and small-world organization [25, 26] and collocation networks, which display small-world and scale-free properties [8]. Of particular relevance to this study, word co-occurrence networks are a special case of collocation networks where two words (nodes) are linked if they appear close to each other in a text. Several patterns have been identified in co-occurrence networks formed from large corpora, such as the power-law regimes for degrees distribution [7] and core-periphery structure [28] resulting from the complex organization of the lexicon. We use the co-occurrence representation to probe how the variation of network topology along a text is able to identify author’s style

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call