Abstract
BackgroundDimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.ResultsWe demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates.ConclusionNot only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.
Highlights
Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers
Results by using SVM In order to demonstrate the importance of feature selection in dimension reduction, we have performed the following series experiments by using support vector machine (SVM) as the classifier: Feature extraction projects the whole data into a low dimensional space and constructs the new dimensions by analyzing the statistical relationship hidden in the data set
Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by genetic algorithm (GA)
Summary
Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Tumor classification is performed on microarray data col- cell samples [1,2,3]. The wealth of this kind of data in dif-. The output of microarray experiment is summarized as an n × p data matrix, where n is the number of tissue or cell samples; p is the number of genes. Dimension reduction methods are applied to reduce the dimensionality from p to q with q
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.