An Improved Term Weighting Method for Content Analysis on Chinese Internet Media Contents

Zhi-Ying Jiang,Xingjian Tian,Yan-Lin He,Bo Gao,Qun-Xiong Zhu

doi:10.1109/codit49905.2020.9263808

Abstract

The term frequency-inverse document frequency (TF-IDF) is a term weighting method that is widely used for textual content analysis. However, when using TF-IDF for classification on unbalanced, distributed Chinese Internet media content, it is unreasonable for IDF to equate the inverse text high frequency having strong distinguishing ability. In this paper, the computational formula for calculating collection frequency is optimized by combining it with the average value of document frequency $(\overline {DF} )$. This combination helps to reduce the influence caused by any unbalanced distribution of content on the Internet. A novel computational formula based on $(\overline {DF} )$ is designed as a collection frequency factor, combining with term frequency (TF) to form a new term weighting method named TF-ISDF. Experiments on support vector machine (SVM) and Random Forests (RF) algorithms are demonstrated to validate the two different term weighting schemes (TWS). The results show that the proposed TF-ISDF approach exhibits better performance than TF-IDF in both classification algorithms and analysis of Chinese Internet media content.

Full Text