The bag-of-words technique is often used to present a document in text categorization. However, for a large set of documents where the dimension of the bag-of-words vector is very high, text categorization becomes a serious challenge as a result of sparse data, over-fitting, and irrelevant features. A filter feature selection method reduces the number of features by eliminating irrelevant features from the bag-of-words vector. In this paper, we analyze the weak points and strong points of two filter feature selection approaches which are the frequency-based approach and the cluster-based approach. Thanks to the analysis, we propose hybrid filter feature selection methods, named the Frequency-Cluster Feature Selection (FCFS) and the Detailed Frequency-Cluster Feature Selection (DtFCFS), to further improve the performance of the filter feature selection process in text categorization. The FCFS is a combination of the Frequency-based approach and the Cluster-based approach, while the DtFCFS, a detailed version of the FCFS, is a comprehensively hybrid clusterbased method. We do experiments with four benchmark datasets (the Reuters-21578 and Newsgroup dataset for news classification, the Ohsumed dataset for medical document classification, and the LingSpam dataset for email classification) to compare the proposed methods with six related wellknown methods such as the Comprehensive Measurement Feature Selection (CMFS), the Optimal Orthogonal Centroid Feature Selection (OCFS), the Crossed Centroid Feature Selection (CIIC), the Information Gain (IG), the Chi-square (CHI), and the Deviation from Poisson Feature Selection (DFPFS). In terms of the Micro-F1, the Macro-F1, and the dimension reduction rate, the DtFCFS is superior to the other methods, while the FCFS shows competitive and even superior performance to the good methods, especially for the Macro-F1.
Read full abstract