Abstract

With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFsDF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.

Highlights

  • Due to the rapid development of internet technology and information infrastructure construction, the volume of text data which can be obtained online has increased dramatically

  • Zhong Tang et al described two deficiencies from which term frequency-inverse document frequency (TF-inverse document frequency (IDF)) suffers, namely, collection frequency factor being undefined or being equal to zero in some special cases. ey proposed a novel method, namely, term frequency-inverse exponential frequency (TF-Inverse exponential frequency (IEF)), to overcome these drawbacks [14]. e proposed methods replaced the IDF with a global weighting factor IEF, and a log-like method is used to characterize the collection frequency factor. It greatly reduced the influence caused by terms with high Term frequency (TF) values, which helped in generating a more representative vector of terms. e experiments stated that the novel methods had an improved performance than compared schemes. e knowledge about Chinese language and Chinese culture provided by Baidu Baike is learned and organized by Chinese language-speaking people and professional employees of Baidu company. erefore, Baidu Baike is used for optimizing text classification (TC) on Chinese text a couple of times in the Chinese language aspect [23, 29]

  • support vector machine (SVM), naıve Bayes (NB), and Relevance frequency (RF) classifiers are utilized as term weighting scheme (TWS) for a comparison with TF-IDF. e overall performance of the proposed TF-IADF outperformed all other methods in SVM and RF classifiers as shown in Figure 8. e details of experimental results are shown in Table 5 in that the proposed TF-IADF+norm demonstrates better performance than TF-IDF in all cases

Read more

Summary

Introduction

Due to the rapid development of internet technology and information infrastructure construction, the volume of text data which can be obtained online has increased dramatically. Is means terms with different distinguishing abilities obtain the same weights from the standard IGM method which is unreasonable [28] In their studies, two novel TWSs, namely, SQRT_TF-IGMimp and TF-IGMimp, are proposed deriving from IGM to overcome its limitations. E proposed methods replaced the IDF with a global weighting factor IEF, and a log-like method is used to characterize the collection frequency factor It greatly reduced the influence caused by terms with high TF values, which helped in generating a more representative vector of terms. Erefore, Baidu Baike is used for optimizing TC on Chinese text a couple of times in the Chinese language aspect [23, 29] Both Baidu Baike-based methods are based on semantic analysis, and huge calculations are required for processing

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call