An evaluation of text classification methods for literary study

B Yu

doi:10.1093/llc/fqn015

Abstract

Text classification methods have been evaluated on topic classification tasks. This thesis extends the empirical evaluation to emotion classification tasks in the literary domain. This study selects two literary text classification problems---the eroticism classification in Dickinson's poems and the sentimentalism classification in early American novels---as two cases for this evaluation. Both problems focus on identifying certain kinds of emotion---a document property other than topic. This study chooses two popular text classification algorithms---naive Bayes and Support Vector Machines (SVM), and three feature engineering options---stemming, stopword removal and statistical feature selection (Odds Ratio and SVM)---as the subjects of evaluation. This study aims to examine the effects of the chosen classifiers and feature engineering options on the two emotion classification problems, and the interaction between the classifiers and the feature engineering options. This thesis seeks empirical answers to the following research questions: (1) is SVM a better classifier than naive Bayes regarding classification accuracy, new literary knowledge discovery and potential for example-based retrieval? (2) is SVM a better feature selection method than Odds Ratio regarding feature reduction rate and classification accuracy improvement? (3) does stop word removal affect the classification performance? (4) does stemming affect the performance of classifiers and feature selection methods? Some of our conclusions are consistent with what are obtained in topic classification, such as Odds Ratio does not improve SVM performance and stop word removal might harm classification. Some conclusions contradict previous results, such as SVM does not beat naive Bayes in both cases. Some findings are new to this area---SVM and naive Bayes select top features in different frequency ranges; stemming might harm feature selection methods. These experiment results provide new insights to the relation between classification methods, feature engineering options and non-topic document properties. They also provide guidance for classification method selection in literary text classification applications.

Highlights

Text classification is a typical scholarly activity in literary study (Unsworth, 2000; Yu and Unsworth, 2006)
For decades computational analysis tools have been used in some literary text classification tasks, such as authorship attribution (Mosteller and Wallace, 1964; Holmes, 1994) and stylistic analysis (Holmes, 1998)
Facing the unique characteristics of literary text classification applications, we have to think about the question whether the existing conclusions on classification method comparison still hold for literary text classification tasks

Summary

Introduction

Dickinson’s poems (Plaisant et al, 2006), and naıve Bayes classification for sentimentalism analysis of early American novels (Horton et al, 2006). These benchmark data sets were limited to news and web documents, which have different characteristics from the creative writings in literature In these evaluation studies, all methods were tested on topic classification tasks. Sometimes scholars would like to have classifiers as examplebased retrieval tools to find more documents of a certain kind, such as ekphrastic poems and historicist catalog poems (Yu and Unsworth, 2006) In these cases, only a small number of training examples are available, which requires the classifiers to learn fast and accurately. Because no benchmark data is available in this domain, the methods are compared on two specific sub-genre classification tasks as case studies, both focusing on identifying certain kinds of emotion, a document property other than topic. Among them ninety-five chapters were labeled as ‘high’ and eighty-nine as ‘low’

Evaluation of text classification methods

Naıve Bayes and SVM classifiers

Stemming

The role of stopwords

Statistical feature selection

Classification evaluation methods

Experiment 1: document representation model selection

Experiment 2: using stopwords as feature sets

Experiment 3: stemming

Experiment 4: statistical feature selection

Experiment 5: learning curve and confidence curve

The Dickinson Erotic Poem Classification

The text representation model selection

Stopword features

Feature weights

Learning curve and confidence curve

Text representation model selection

Learning curves and confidence curves

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Literary and Linguistic Computing	Publication Date: Sep 5, 2008
Citations: 109	License type: cc-by

R Discovery Prime

R Discovery Prime

An evaluation of text classification methods for literary study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Literary and Linguistic Computing

Lead the way for us

Similar Papers

A novel filter feature selection method for text classification: Extensive Feature Selector
Alper Kursat Uysal ... Bekir Parlak
Journal of Information Science | VOL. 49
Alper Kursat Uysal, et. al.Alper Kursat Uysal ... Bekir Parlak
13 Apr 2021
Journal of Information Science | VOL. 49

Research on Feature Selection and kNN Classification Method in Chinese Text Classification
Ping Wu ... Chao Xiao
-
Ping Wu, et. al.Ping Wu ... Chao Xiao
01 Jan 2015
01 Jan 2015

Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification.
Wondmagegn Taye Abebe ... Getamesay Haile
PeerJ. Computer science | VOL. 8
Wondmagegn Taye Abebe, et. al.Wondmagegn Taye Abebe ... Getamesay Haile
25 Apr 2022
PeerJ. Computer science | VOL. 8

Comparison on Feature Selection Methods for Text Classification
Ming Hong ... Wenkai Liu
-
Ming Hong, et. al.Ming Hong ... Wenkai Liu
17 Jan 2020
17 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An evaluation of text classification methods for literary study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Literary and Linguistic Computing