A novel weighting formula and feature selection for text classification based on rough set theory

Qinghua Hu Qinghua Hu,Wen Bao Wen Bao,Yanfeng Duan Yanfeng Duan,D Yu

doi:10.1109/nlpke.2003.1275985

Abstract

Weighting formula and feature selection are key preprocessing in text classifying and mining. We analyze the drawbacks of weighting formula based on inverse document frequency and present a novel feature weighting and selecting method based on variable precision rough set model. Inverse document frequency (IDF) doesn't take the classification information into account and the criterion based on IDF is not monotonous with the contribution that a feature makes to classification, which decreases the classifier's performance. The measure of classification quality based on variable rough set model can deal with complex classification. It measures the contribution a feature makes to classification. It is introduced as a criterion for feature selecting and weighting in text classification. We name it as TFACQ. The experimental results show that the weighting formula and feature selection based on TFACQ have greatly improved the performance.

Full Text