Comparison of Supervised Classification Models on Textual Data

Bi-Min Hsu

doi:10.3390/math8050851

Abstract

Text classification is an essential aspect in many applications, such as spam detection and sentiment analysis. With the growing number of textual documents and datasets generated through social media and news articles, an increasing number of machine learning methods are required for accurate textual classification. For this paper, a comprehensive evaluation of the performance of multiple supervised learning models, such as logistic regression (LR), decision trees (DT), support vector machine (SVM), AdaBoost (AB), random forest (RF), multinomial naive Bayes (NB), multilayer perceptrons (MLP), and gradient boosting (GB), was conducted to assess the efficiency and robustness, as well as limitations, of these models on the classification of textual data. SVM, LR, and MLP had better performance in general, with SVM being the best, while DT and AB had much lower accuracies amongst all the tested models. Further exploration on the use of different SVM kernels was performed, demonstrating the advantage of using linear kernels over polynomial, sigmoid, and radial basis function kernels for text classification. The effects of removing stop words on model performance was also investigated; DT performed better with stop words removed, while all other models were relatively unaffected by the presence or absence of stop words.

Highlights

Classification, the process of assigning categories to data according to their content, plays an essential role in data analysis and management
All models performed significantly better on the Internet Movie Database (IMDb) dataset, suggesting that binary-class classification is easier than multi-class classification
The investigation of eight supervised learning models on text classification datasets demonstrated the dependence of model performance on the dataset size and the proportionality of classes

Summary

Introduction

Classification, the process of assigning categories to data according to their content, plays an essential role in data analysis and management. Many text classification methods exist, such as the extended stochastic complexity (ESC) stochastic decision list proposed by Li and Yamanishi that uses if--else rules to categorize text [3]. These methods require deep knowledge of the domain of the text, and, since text does not take on a precise structure, the task of analyzing and interpreting text is extremely complex without the use of machine learning approaches

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: May 24, 2020
Citations: 31	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Comparison of Supervised Classification Models on Textual Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

Ensemble Learners of Multiple Deep CNNs for Pulmonary Nodules Classification Using CT Images
Baihua Zhang ... Fan Yang
IEEE Access | VOL. 7
Baihua Zhang, et. al.Baihua Zhang ... Fan Yang
01 Jan 2019
IEEE Access | VOL. 7

ВІД АНАЛІЗУ ТЕКСТУ ДО МОДЕЛЮВАННЯ ПРИРОДНОЇ МОВИ: КОМПЛЕКСНЕ ДОСЛІДЖЕННЯ
Микола Стахів ... Стапан Скопівський
Herald of Khmelnytskyi National University. Technical sciences | VOL. 333
Микола Стахів, et. al.Микола Стахів ... Стапан Скопівський
25 Apr 2024
Herald of Khmelnytskyi National University. Technical sciences | VOL. 333

A New Term Frequency with Gaussian Technique for Text Classification and Sentiment Analysis
Vuttichai Vichianchai ... Sumonta Kasemvilas
Journal of ICT Research and Applications | VOL. 15
Vuttichai Vichianchai, et. al.Vuttichai Vichianchai ... Sumonta Kasemvilas
07 Oct 2021
Journal of ICT Research and Applications | VOL. 15

Applications of Machine Learning Techniques to Predict Diagnostic Breast Cancer
Vikas Chaurasia ... Saurabh Pal
SN Computer Science | VOL. 1
Vikas Chaurasia, et. al.Vikas Chaurasia ... Saurabh Pal
14 Aug 2020
SN Computer Science | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparison of Supervised Classification Models on Textual Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics