Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Hassina Hadjadj,Halim Sayoud

doi:10.4018/ijcini.20211001.oa33

Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

Highlights

Authorship attribution (AA) is one of the earliest research fields of computational linguistics and has a long history in identifying disputed or unknown authors (Mosteller & Wallace, 1984)
This article proposes a new hybrid approach based on principal components analysis (PCA) and synthetic minority oversampling technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data
The obtained results of the experiments show that the proposed approach using the Sequential Minimal Optimization Based Support Vector Machine (SMO-Support Vector Machine (SVM)) classifier presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams

Summary

Introduction

Authorship attribution (AA) is one of the earliest research fields of computational linguistics and has a long history in identifying disputed or unknown authors (Mosteller & Wallace, 1984). AA can be used to identify the document sources (Li et al, 2013), disputed authorship (Eder, 2015), plagiarism detection in student essays (AlSallal et al, 2019), etc. The suitable set of features is extracted and combined with the more reliable classification technique to find the right author. In this regard, function words (stop words) and the spelling errors should be kept, because they have a substantial

Objectives

Methods

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Cognitive Informatics and Natural Intelligence	Publication Date: Jul 29, 2021
Citations: 3	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Cognitive Informatics and Natural Intelligence

Lead the way for us

Similar Papers

Arabic Authorship Attribution Using Synthetic Minority Over-sampling Technique and Principal Components Analysis for Imbalanced Documents
-
International Journal of Cognitive Informatics and Natural Intelligence | VOL. 15
--
01 Oct 2021
International Journal of Cognitive Informatics and Natural Intelligence | VOL. 15

An Investigation of SMOTE based Methods for Imbalanced Datasets with Data Complexity Analysis
Nur Athirah Azhar ... Aniza Mohamed Din
IEEE Transactions on Knowledge and Data Engineering | VOL. -
Nur Athirah Azhar, et. al.Nur Athirah Azhar ... Aniza Mohamed Din
01 Jan 2021
IEEE Transactions on Knowledge and Data Engineering | VOL. -

Automated semiconductor wafer defect classification dealing with imbalanced data
Po-Hsuan Lee ... Ofer Adan
-
Po-Hsuan Lee, et. al.Po-Hsuan Lee ... Ofer Adan
20 Mar 2020
20 Mar 2020

Whale Optimization-based Synthetic Minority Oversampling Technique for Binary Imbalanced Datasets
Pooja Tyagi ... Anjana Gosain
Procedia Computer Science | VOL. 235
Pooja Tyagi, et. al.Pooja Tyagi ... Anjana Gosain
01 Jan 2024
Procedia Computer Science | VOL. 235

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Cognitive Informatics and Natural Intelligence