Abstract

BackgroundOne intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.MethodsWe screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.ResultsWe applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.ConclusionIn cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.

Highlights

  • One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise

  • The decision rules induced by gene #4847 in every leave-one-out training set are of the following form: if g(#4847) > t, Acute Myeloid Leukemia (AML); if g(#4847) ≤ t, Acute Lymphoblastic Leukemia (ALL), where t is equal or close to 994

  • The final leave-one-out cross-validation (LOOCV) accuracy resulting from the gene was 97.4%, with 37 of the 38 samples classified correctly, wherein all of the 27 ALL samples were classified correctly, and one AML sample was misclassified

Read more

Summary

Introduction

One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. One intractable problem with using microarray data analysis to create cancer classifiers is how to reduce the exceedingly high-dimensional gene expression data, which contain a large amount of noise. On the other hand, compared with the measured quantities of gene expression levels in experiments, the numbers of samples are severely limited. This brings about two computational challenges: computational cost and classification accuracy. It is not easy to gauge which gene is essential in determining a cancerous class if accurate classification is obtained based on a large cluster of genes

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.