Feature selection effectively reduces the dimensionality of data. For feature selection, rough set theory offers a systematic theoretical framework based on consistency measures, of which information entropy is one of the most important significance measures of attributes. However, an information-entropy-based significance measure is computationally expensive and requires repeated calculations. Although many accelerating strategies have been proposed thus far, there remains a bottleneck when using an information-entropy-based feature selection algorithm to handle large-scale datasets with high dimensions. In this study, we introduce a classified nested equivalence class (CNEC)-based approach to calculate the information-entropy-based significance for feature selection using rough set theory. The proposed method extracts knowledge of the reducts of a decision table to reduce the universe and construct CNECs. By exploring the properties of different types of CNECs, we can not only accelerate both outer and inner significance calculation by discarding useless CNECs but also effectively decrease the number of inner significance calculations by using one type of CNECs. The use of CNECs is shown to significantly enhance three representative entropy-based feature selection algorithms that use rough set theory. The feature subset selected by the CNEC-based algorithms is the same as that selected by algorithms using the original definition of information entropies. Experiments conducted using 31 datasets from multiple sources, such as the UCI repository and KDD Cup competition, including large-scale and high-dimensional datasets, confirm the efficiency and effectiveness of the proposed method.
Read full abstract