A refinement approach to handling model misfit in text categorization

Haoran Wu,Tong Heng Phang,Bing Liu,Xiaoli Li

doi:10.1145/775047.775078

Abstract

Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques have been proposed. However, most techniques are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. However, when the data does not fit the model well, the classification accuracy can be very low. In this paper, we propose a refinement approach to dealing with this problem of model misfit. We show that we do not need to change the classification technique itself (or its underlying model) to make it more flexible. Instead, we propose to use successive refinements of classification on the training data to correct the model misfit. We apply the proposed technique to improve the classification performance of two simple and efficient text classifiers, the Rocchio classifier and the naive Bayesian classifier. These techniques are suitable for very large text collections because they allow the data to reside on disk and need only one scan of the data to build a text classifier. Extensive experiments on two benchmark document corpora show that the proposed technique is able to improve text categorization accuracy of the two techniques dramatically. In particular, our refined model is able to improve the naive Bayesian or Rocchio classifier's prediction performance by 45% on average.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A refinement approach to handling model misfit in text categorization

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

The Rocchio classifier and second generation wavelets
Patricia H Carter
-
Patricia H CarterPatricia H Carter
09 Apr 2007
09 Apr 2007

Feature Selection Techniques and Classification Accuracy of Supervised Machine Learning in Text Mining
...
Journal of Information Engineering and Applications | VOL. 9
, et. al. ...
01 May 2019
Journal of Information Engineering and Applications | VOL. 9

An Improved Naive Bayesian Classification Model Based on Attribute Weighting
Xi Yue ... Mengxuan Tang
Journal of Physics: Conference Series | VOL. 1550
Xi Yue, et. al.Xi Yue ... Mengxuan Tang
01 May 2020
Journal of Physics: Conference Series | VOL. 1550

A Two Step Data Mining Approach for Amharic Text Classification
...
-
, et. al. ...
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A refinement approach to handling model misfit in text categorization

Abstract

Talk to us

Similar Papers