Abstract

The term frequency-inverse document frequency (TF-IDF) is a term weighting method that is widely used for textual content analysis. However, when using TF-IDF for classification on unbalanced, distributed Chinese Internet media content, it is unreasonable for IDF to equate the inverse text high frequency having strong distinguishing ability. In this paper, the computational formula for calculating collection frequency is optimized by combining it with the average value of document frequency $(\overline {DF} )$. This combination helps to reduce the influence caused by any unbalanced distribution of content on the Internet. A novel computational formula based on $(\overline {DF} )$ is designed as a collection frequency factor, combining with term frequency (TF) to form a new term weighting method named TF-ISDF. Experiments on support vector machine (SVM) and Random Forests (RF) algorithms are demonstrated to validate the two different term weighting schemes (TWS). The results show that the proposed TF-ISDF approach exhibits better performance than TF-IDF in both classification algorithms and analysis of Chinese Internet media content.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call