Abstract

How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose the minimum redundancy and maximum relevance feature selection framework. In this paper we have applied two general approaches of feature subset selection, more specifically, wrapper and filter approaches and then created a new model called hybrid model by combining the characteristics of the two specified models for gene selection. We have also compared the gene selection performance of the filter model, wrapper model and hybrid model. This lead to significantly improved class predictions in extensive experiments on 4 gene expression data sets: CNS, Leukemia, Lung and Brain Tumor. Improvements are observed consistently among 3 neural network algorithms classification methods such as Linear Vector Quantization (LVQ), Self-Organization Map (SOM) and Back Propagation (BP). The selection of efficient feature extraction techniques and predictive models provide high classification accuracy for microarray dataset. As such, the investigations show that a small number of gene expression data has strong correlation with certain phenotypes compared to the total number of genes available. Consequently, selections of differentially expressed relevant predictor genes correctly analyse gene expression profiles and also play a crucial role in classification process. The subset of potential genes identified by feature selection technique correctly distinguish the sample classes. Therefore, a good selection method for genes, relevant for sample classification- based on the number of genes investigated-is needed to increase the predictive accuracy and to avoid incomprehensibility. An application of the gene expression data analysis is cancer classification. Usually, expression levels of the

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call