Abstract

Principal component analysis for authorship attribution Background: To recognize the authors of the texts by the use of statistical tools, one first needs to decide about the features to be used as author characteristics, and then extract these features from texts. The features extracted from texts are mostly the counts of so called function words. Objectives: The data extracted are processed further to compress as a data with less number of features, such a way that the compressed data still has the power of effective discriminators. In this case feature space has less dimensionality then the text itself. Methods/Approach: In this paper, the data collected by counting words and characters in around a thousand paragraphs of each sample book, underwent a principal component analysis performed using neural networks. Once the analysis was complete, the first of the principal components is used to distinguish the books authored by a certain author. Results: The achieved results show that every author leaves a unique signature in written text that can be discovered by analyzing counts of short words per paragraph. Conclusions: In this article we have demonstrated that based on analyzing counts of short words per paragraph authorship could be traced using principal component analysis. Methodology could be used for other purposes, like fraud detection in auditing.

Highlights

  • The author identification problem is the oldest of the all text classification problems, it is still not well organized

  • First the function words are counted from selected text, this data is transferred into principal components world

  • In this article we have proposed a method of authorship detection using principal component analysis for texts written in Bosnian language, which has proven as an efficient tool

Read more

Summary

Introduction

The author identification problem is the oldest of the all text classification problems, it is still not well organized Throughout its history, it is mishandled by statisticians. Objectives: The data extracted are processed further to compress as a data with less number of features, such a way that the compressed data still has the power of effective discriminators. In this case feature space has less dimensionality the text itself. Conclusions: In this article we have demonstrated that based on analyzing counts of short words per paragraph authorship could be traced using principal component analysis.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call