Microarray data are highly redundant and noisy, and most genes are believed to be uninformative with respect to studied classes, as only a fraction of genes may present distinct profiles for different classes of samples. This paper proposed a novel hybrid framework (NHF) for the classification of high dimensional microarray data, which combined information gain(IG), F-score, genetic algorithm(GA), particle swarm optimization(PSO) and support vector machines(SVM). In order to identify a subset of informative genes embedded out of a large dataset which is contaminated with high dimensional noise, the proposed method is divided into three stages. In the first stage, IG is used to construct a ranking list of features, and only 10% features of the ranking list are provided for the second stage. In the second stage, PSO performs the feature selection task combining SVM. F-score is considered as a part of the objective function of PSO. The feature subsets are filtered according to the ranking list from the first stage, and then the results of it are supplied to the initialization of GA. Both the SVM parameter optimization and the feature selection are dynamically executed by PSO. In the third stage, GA initializes the individual of population from the results of the second stage, and an optimal result of feature selection is gained using GA integrating SVM. Both the SVM parameter optimization and the feature selection are dynamically performed by GA. The performance of the proposed method was compared with that of the PSO based, GA based, Ant colony optimization (ACO) based and simulated annealing (SA) based methods on five benchmark data sets, leukemia, colon, breast cancer, lung carcinoma and brain cancer. The numerical results and statistical analysis show that the proposed approach is capable of selecting a subset of predictive genes from a large noisy data set, and can capture the correlated structure in the data. In addition, NHF performs significantly better than the other methods in terms of prediction accuracy with smaller subset of features.
Read full abstract