Network Data Analysis of Word Graphs With Applications to Authorship Attribution

Timothy Leonard

doi:10.23860/thesis-leonard-timothy-2018

Abstract

Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This thesis shows that a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. This thesis applies nominal assortativity of parts of speech, a network data characteristic of word graphs, to the problem of authorship attribution and shows how these features are produced from a word graph model. Specifically, it is shown that the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in a word network to be connected by an edge, produces a feature set that can be used to predict authorship. These results are compared to the POS bigram model, a highly accurate authorship attribution model, and show that the nominal assortativity model is competitive. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech reveals regular structural properties of English grammar.

Highlights

A fox jumped over Sir Walters the lazy dog The representation of text as a word graph produces the well documented feature sets such as the word frequency model and the part-of-speech bigram model used in authorship attribution tasks
Future research could explore the relationships among natural languages using nominal assortativity similar to the comparative work done on motifs by Milo et al Milo found that three vertex motifs occurred at similar rates for different language such as French and Japanese
The information contained in a word graph produces several well documented feature sets used in authorship attribution tasks including the part of speech bigram model examined in this article

Summary

Word-Network

Model A word-network model is a directed graph G=(V, E) with a set V of vertices represented as unique words and a set E of edges, where elements of E are ordered pairs u, v, or bigrams, of distinct words u, v ∈ V appearing consecutively within sentences in a sample text. A fox jumped over Sir Walters the lazy dog The representation of text as a word graph produces the well documented feature sets such as the word frequency model and the part-of-speech bigram model used in authorship attribution tasks (see Table 1 and Table 2). This graph representation allows application of various network data analysis methods, such as the reporting of network characteristics including degree distribution, graph density, and nominal assortativity

Feature Sets

Model Testing

Using Assortativity to Compare

Findings

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Network Data Analysis of Word Graphs With Applications to Authorship Attribution

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Assortative Mixture of English Parts of Speech
Timothy Leonard ... Noah M Daniels
-
Timothy Leonard, et. al.Timothy Leonard ... Noah M Daniels
27 Nov 2017
27 Nov 2017

Language models and fusion for authorship attribution
Olga Fourkioti ... Avi Arampatzis
Information Processing & Management | VOL. 56
Olga Fourkioti, et. al.Olga Fourkioti ... Avi Arampatzis
05 Jul 2019
Information Processing & Management | VOL. 56

Author Clustering with and Without Topical Features
Polina Panicheva ... Olga Litvinova
-
Polina Panicheva, et. al.Polina Panicheva ... Olga Litvinova
01 Jan 2019
01 Jan 2019

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
Tatiana Litvinova ... Olga Litvinova
-
Tatiana Litvinova, et. al.Tatiana Litvinova ... Olga Litvinova
28 Jun 2019
28 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Network Data Analysis of Word Graphs With Applications to Authorship Attribution

Abstract

Highlights

Summary

Talk to us

Similar Papers