An improved supervised term weighting scheme for text representation and classification

Zhong Tang,Wenqiang Li,Yan Li

doi:10.1016/j.eswa.2021.115985

Abstract

Term weighting scheme has significant effects on the text classification performance. The main reason is that in text classification tasks, term weighting scheme determines the way in which texts are represented in the vector space model. Currently, term frequency-inverse document frequency is the most widely utilized term weighting scheme but it does not use the available category information of the training texts. Taking this resource of category information (or category factor) into account in the study, an improved supervised term weighting method for representing text is developed, which combines a new measure of information namely cumulative residual entropy and the proportional distortion function. To verify the text classification performance of our proposed scheme, we conducted an extensive experimental comparison of proposed scheme with existing schemes on two corpora (i.e., Reuters-21578 and 20 Newsgroups datasets) with different characteristics. Results explicitly show that our proposed scheme can obtain significantly better effect for text classification than others. Specifically, when linear support vector machine classifier is run, performances were improved to 0.972 and 0.833 (micro-F1) on Reuters-21578 dataset and 20 Newsgroups dataset, respectively.

Full Text