Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution

Hayri Volkan Agun,Ozgur Yilmazel

doi:10.1109/access.2019.2930536

Abstract

Authorship attribution (AA) is a stylometric analysis task of finding the author of an anonymous/disputed text document. In AA, the performance improvement of class-based feature selection schemas, such as Chi-square, and Gini index over frequency-based feature selection schemas, such as document frequency, common n-grams, and inverted document frequency has been shown to be limited. In AA, the feature selection process is significantly affected by topic distributions. In this paper, we assess the performance of a global feature selection approach into which the document’s topic category is incorporated to scale the existing feature weights. In this approach, the common features of an author among different topics indicate higher relevance for the author and thus have higher weights. On the other hand, features with biased topic distributions are assumed to have high topic relevance and lower weights. In this approach, the global topic measure and the author specific topic measure are combined in order to scale the existing selection weights of the features. The ten-fold cross-validation experiment result on a multi-topic dataset with a random topic distribution indicates that our approach improves the performance of Chi-square, modified Gini index, and common n-grams schemas significantly in the best performing configurations of the classifiers.

Highlights

The task of authorship attribution (AA) is the identification of the author of a disputed/unknown text document
Function words – a well-known feature set in AA – have higher document frequencies, when Inverted document frequency (IDF) selection schema is applied on arbitrary words, most function words will get lower scores and be eliminated
Modern feature selection schemas on text classification tasks have been experimented in content dependent tasks where the document content and target label are directly related

Summary

INTRODUCTION

The task of authorship attribution (AA) is the identification of the author of a disputed/unknown text document. Feature sets suggested for exploiting the stylometric properties of the authors are generally assumed to be topic independent, and they encode little or no information about the content of the document. In recent studies, these feature sets are addressed as vocabulary richness, readability measures, character n-grams, terms and function words [5]–[8]. Odds ratio and chi square have been compared on datasets with very few authors According to these comparisons, simple document frequency (DF) based term selection has been reported to be quite competitive with other feature selection methods [9], [10].

FEATURE SELECTION METHODS

INVERTED DOCUMENT FREQUENCY

CHI-SQUARE

AUTHOR SPECIFIC TOPIC MEASURE

EXPERIMENTS

EVALUATION

RESULTS

Findings

CONCLUSIONS

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2019
Citations: 35	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

Local-to-global semi-supervised feature selection
Mohammed Hindawi ... Khalid Benabdeslem
-
Mohammed Hindawi, et. al.Mohammed Hindawi ... Khalid Benabdeslem
01 Jan 2013
01 Jan 2013

Impact of Feature Extraction and Feature Selection Algorithms on Punjabi Speech Emotion Recognition Using Convolutional Neural Network
Kamaldeep Kaur ... Parminder Singh
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 21
Kamaldeep Kaur, et. al.Kamaldeep Kaur ... Parminder Singh
29 Apr 2022
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 21

Semi-supervised Clustering of Graph Objects: A Subgraph Mining Approach
Xin Huang ... Hongliang Fei
-
Xin Huang, et. al.Xin Huang ... Hongliang Fei
01 Jan 2012
01 Jan 2012

Maximum margin and global criterion based-recursive feature selection
Xiaojian Ding ... Shilin Chen
Neural networks : the official journal of the International Neural Network Society | VOL. 169
Xiaojian Ding, et. al.Xiaojian Ding ... Shilin Chen
02 Nov 2023
Neural networks : the official journal of the International Neural Network Society | VOL. 169

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions