Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

Highlights

  • Authorship attribution (AA) is one of the earliest research fields of computational linguistics and has a long history in identifying disputed or unknown authors (Mosteller & Wallace, 1984)

  • This article proposes a new hybrid approach based on principal components analysis (PCA) and synthetic minority oversampling technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data

  • The obtained results of the experiments show that the proposed approach using the Sequential Minimal Optimization Based Support Vector Machine (SMO-Support Vector Machine (SVM)) classifier presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams

Read more

Summary

Introduction

Authorship attribution (AA) is one of the earliest research fields of computational linguistics and has a long history in identifying disputed or unknown authors (Mosteller & Wallace, 1984). AA can be used to identify the document sources (Li et al, 2013), disputed authorship (Eder, 2015), plagiarism detection in student essays (AlSallal et al, 2019), etc. The suitable set of features is extracted and combined with the more reliable classification technique to find the right author. In this regard, function words (stop words) and the spelling errors should be kept, because they have a substantial

Objectives
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.