Abstract

This paper investigates the relative effectiveness and accuracy of multivariate analysis, specifically cluster analysis, of the frequencies of very frequent words and the frequencies of very frequent word sequences in distinguishing texts by different authors and grouping texts by a single author. Cluster analyses based on frequent words are fairly accurate for groups of texts by known authors, whether the texts are long sections of modern British and US novels or shorter sections of contemporary literary critical texts, but they are only rarely completely accurate. When frequent word sequences are used instead of frequent words or in addition to them, however, the accuracy of the analyses often improves, sometimes dramatically, especially when personal pronouns are eliminated. Analyses based on frequent sequences even provide completely correct results in some cases where analyses based on frequent words fail. They also produce superior results for small groups of problematic novels and critical texts extracted from the larger corpora. Such successes suggest that analyses based on frequent word sequences constitute improved tools for authorship and stylistic studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call