Abstract

Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This thesis shows that a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. This thesis applies nominal assortativity of parts of speech, a network data characteristic of word graphs, to the problem of authorship attribution and shows how these features are produced from a word graph model. Specifically, it is shown that the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in a word network to be connected by an edge, produces a feature set that can be used to predict authorship. These results are compared to the POS bigram model, a highly accurate authorship attribution model, and show that the nominal assortativity model is competitive. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech reveals regular structural properties of English grammar.

Highlights

  • A fox jumped over Sir Walters the lazy dog The representation of text as a word graph produces the well documented feature sets such as the word frequency model and the part-of-speech bigram model used in authorship attribution tasks

  • Future research could explore the relationships among natural languages using nominal assortativity similar to the comparative work done on motifs by Milo et al Milo found that three vertex motifs occurred at similar rates for different language such as French and Japanese

  • The information contained in a word graph produces several well documented feature sets used in authorship attribution tasks including the part of speech bigram model examined in this article

Read more

Summary

Word-Network

Model A word-network model is a directed graph G=(V, E) with a set V of vertices represented as unique words and a set E of edges, where elements of E are ordered pairs u, v, or bigrams, of distinct words u, v ∈ V appearing consecutively within sentences in a sample text. A fox jumped over Sir Walters the lazy dog The representation of text as a word graph produces the well documented feature sets such as the word frequency model and the part-of-speech bigram model used in authorship attribution tasks (see Table 1 and Table 2). This graph representation allows application of various network data analysis methods, such as the reporting of network characteristics including degree distribution, graph density, and nominal assortativity

Feature Sets
Model Testing
Using Assortativity to Compare
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call