Abstract

Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.

Highlights

  • To study online media content, researchers use methods of text classification to analyze large volumes of text data [1]

  • We defined the following groups of training set sizes: small training sets consist of 50–500 training examples, large training sets consist of 550–1000 examples and, training set sizes of 2000 and 10,000 training examples are grouped individually

  • This paper reports an experimental study, examining the design factors that affect the accuracy of machine learning text classifiers for small, manually annotated datasets

Read more

Summary

Introduction

To study online media content, researchers use methods of text classification to analyze large volumes of text data [1]. Text classifiers using supervised machine learning can be adapted to new classes and texts without modifying the algorithm, requiring an annotated training dataset only [2]. Such training datasets are often not available for a certain class or topic of interest and a custom dataset needs to be manually annotated. The generated classifier’s accuracy should increase with every additional text sample [3]. Statistically each additional text increases the accuracy less than the previously added text, because of the asymptotical shape of the learning curve [4]. To minimize human annotation effort, an optimal-sized training set that provides the best trade-off between classification accuracy

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.