A language-independent authorship attribution approach for author identification of text documents

Reza Ramezani

doi:10.1016/j.eswa.2021.115139

Abstract

In the Authorship Attribution (AA) task, the most likely author of textual documents, such as books, papers, news, and text messages and posts are identified using statistical and computational methods. In this paper, a new computational approach is presented for identifying the most likely author of text documents. The proposed solution emphasizes lazy profile-based classification and, by using the Term Frequency-Inverse Document Frequency (TF_IDF) scheme, introduces a new measure for identifying important terms of documents. The importance of the terms is then used to calculate the similarity between an anonymous document and known documents. The proposed solution works with raw text documents and does not require any NLP tools for preprocessing, which makes it language-independent. The efficiency of the proposed solution has been evaluated by conducting several experiments on two English and Persian datasets, each of which contains six corpora with different number of authors. The obtained results demonstrate that the proposed solution outperforms state-of-the-art stylometric features, employed by seven well-known classifiers, by obtaining 0.902 accuracy for the English dataset and 0.931 accuracy for the Persian dataset. In addition, supplementary experiments have been conducted to evaluate the effects of documents’ length on the accuracy of the proposed solution, to examine the computation time of the proposed solution and competitive classifiers, and to identify the most effective stylometric features and classifiers.

Full Text