Representation of texts as complex networks: a mesoscopic approach

Henrique Ferraz de Arruda,Diego Raphael Amancio,Filipi Nascimento Silva,Luciano da Fontoura Costa,Vanessa Queiroz Marinho

doi:10.1093/comnet/cnx023

Henrique Ferraz de Arruda, Diego Raphael Amancio + Show 3 more

Open Access

https://doi.org/10.1093/comnet/cnx023

Copy DOI

Abstract

Statistical techniques that analyze texts, referred to as text analytics, have departed from the use of simple word count statistics towards a new paradigm. Text mining now hinges on a more sophisticated set of methods, including the representations in terms of complex networks. While well-established word-adjacency (co-occurrence) methods successfully grasp syntactical features of written texts, they are unable to represent important aspects of textual data, such as its topical structure, i.e. the sequence of subjects developing at a mesoscopic level along the text. Such aspects are often overlooked by current methodologies. In order to grasp the mesoscopic characteristics of semantical content in written texts, we devised a network model which is able to analyze documents in a multi-scale fashion. In the proposed model, a limited amount of adjacent paragraphs are represented as nodes, which are connected whenever they share a minimum semantical content. To illustrate the capabilities of our model, we present, as a case example, a qualitative analysis of "Alice's Adventures in Wonderland". We show that the mesoscopic structure of a document, modeled as a network, reveals many semantic traits of texts. Such an approach paves the way to a myriad of semantic-based applications. In addition, our approach is illustrated in a machine learning context, in which texts are classified among real texts and randomized instances.

Full Text