Abstract

Classification of cancer and selection of genes is one of the most important application of DNA microarray data. As a result of the higher dimensionality of microarray data, classification and selection of gene techniques are frequently employed to support the professional systems in the diagnosing ability of cancer with higher precision in classification. Least absolute shrinkage and selection operator (LASSO) is one of the most popular method for cancer classification and gene selection in high dimensional data. However, Lasso has limitations of being biased and cannot select variables more than the sample size (n) in gene selection and classification of high dimensional microarray data. To address this problems, LASSO-C1F was proposed using scale invariant measure of maximal information complexity of covariance matrix denoted with weight modifications as data-adaptive alternative to the fairly arbitrary choice of the regularization term in the least absolute shrinkage and selection operator (LASSO). The results indicated the effectiveness of the proposed method LASSO-C1F over the classical LASSO. The evaluation criteria result shows that the proposed method, LASSO-C1F has a better performance in terms of AUC and number of genes selected.

Highlights

  • With recent development of high dimensional microarray data in genetic and molecular biology, the resultant datasets clearly have a small size of sample with a higher dimension where the size of the sample is typically in the range of hundreds, whereas the number of genes is in tens of thousands[1], [2] .The success of any statistical method in high dimensional data rely on the pre-determination of dissimilar features[3]

  • The classical least absolute shrinkage and selection operator (LASSO) algorithm and the modified algorithm were tested using a training set with 80% of the original size and a testing set with 20% over 50 Monte-Carlo cross-validation (MCCV) iterations with two microarray data sets

  • Standard statistical classifier performance metrics including misclassification error rates (MER), correct classification rates (CCR), sensitivity (SEN), specificity (SPEC), positive predictive values (PPV), negative predictive values (NPV), balance accuracy (BA), G-means and area under the ROC curve (AUC), respectively, were estimated on the 20% test data set over 50 Monte-Carlo cross-validation (MCCV) iterations

Read more

Summary

Introduction

With recent development of high dimensional microarray data in genetic and molecular biology, the resultant datasets clearly have a small size of sample with a higher dimension where the size of the sample is typically in the range of hundreds, whereas the number of genes is in tens of thousands[1], [2] .The success of any statistical method in high dimensional data rely on the pre-determination of dissimilar features[3]. The aim of feature selection and dimension reduction is identifying the least possible but most significant subset. Various feature selection approach have been proposed in the literature[3]–[8]. Norazlina Bint Ismail, Department of Mathematics, Faculty of Science, Universiti Teknologi Malaysia 81310 UTM Skudai, Johor, Malaysia.

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call