Simple Baseline Machine Learning Text Classifiers for Small Datasets

Martin Riekert,Matthias Riekert,Achim Klein

doi:10.1007/s42979-021-00480-4

Martin Riekert, Matthias Riekert + Show 1 more

Open Access

https://doi.org/10.1007/s42979-021-00480-4

Copy DOI

Abstract

Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.

Highlights

To study online media content, researchers use methods of text classification to analyze large volumes of text data [1]
We defined the following groups of training set sizes: small training sets consist of 50–500 training examples, large training sets consist of 550–1000 examples and, training set sizes of 2000 and 10,000 training examples are grouped individually
This paper reports an experimental study, examining the design factors that affect the accuracy of machine learning text classifiers for small, manually annotated datasets

Summary

Introduction

To study online media content, researchers use methods of text classification to analyze large volumes of text data [1]. Text classifiers using supervised machine learning can be adapted to new classes and texts without modifying the algorithm, requiring an annotated training dataset only [2]. Such training datasets are often not available for a certain class or topic of interest and a custom dataset needs to be manually annotated. The generated classifier’s accuracy should increase with every additional text sample [3]. Statistically each additional text increases the accuracy less than the previously added text, because of the asymptotical shape of the learning curve [4]. To minimize human annotation effort, an optimal-sized training set that provides the best trade-off between classification accuracy

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: SN Computer Science	Publication Date: Mar 30, 2021
Citations: 7	License type: open-access

R Discovery Prime

R Discovery Prime

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SN Computer Science

Lead the way for us

Similar Papers

A new text classification technique using small training sets
Fabio Clarizia ... Francesco Colace
-
Fabio Clarizia, et. al.Fabio Clarizia ... Francesco Colace
01 Nov 2011
01 Nov 2011

Automated categorisation of clinical incident reports using statistical text classification
M.-S Ong ... F Magrabi
BMJ Quality & Safety | VOL. 19
M.-S Ong, et. al.M.-S Ong ... F Magrabi
19 Aug 2010
BMJ Quality & Safety | VOL. 19

The effect of training set size and composition on artificial neural network classification
G.M Foody ... M B Mcculloch
International Journal of Remote Sensing | VOL. 16
G.M Foody, et. al.G.M Foody ... M B Mcculloch
01 Jun 1995
International Journal of Remote Sensing | VOL. 16

Ensemble based classification using small training sets : A novel approach
C V Krishna Veni ... T Sobha Rani
-
C V Krishna Veni, et. al.C V Krishna Veni ... T Sobha Rani
01 Dec 2014
01 Dec 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Simple Baseline Machine Learning Text Classifiers for Small Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: SN Computer Science