Abstract

Inverse gravity moment (IGM) is a recent term weighting scheme in the text classification literature. The idea is that a distinguishing term should concentrate around preferably one or limited number of classes. IGM considers document frequencies of a term over all classes. However, it cannot handle the class imbalance problem. The natural distribution of documents in the text classification is frequently imbalanced. The classifier generally tend to bias toward majority classes, classes with many samples. Therefore, documents from minority classes might be ignored. In this study, we tackle the class imbalance problem in IGM and propose to use a factor called relative imbalance ratio (RIR). The aim of RIR coefficient is to scale document frequencies of the terms from minority classes in order to amplify the IGM score for the terms from the minority classes. Otherwise, those terms might be dwarfed due to the fact that majority classes have many more documents. Experimental results with three data sets, two of which are imbalanced, show that our proposed method manage to outperform the original IGM method as well as the improved IGM (IIGM) and seven other the state-of-the-art term weighting schemes (TF-ICF, TF-ICSDF, TF-RF, TF-PROB, TF-MONO, RE, AFE-MERT) in terms of f1-macro results while not comprising f1-micro.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call