An integrated method for cancer classification and rule extraction from microarray data

Liang-Tsung Huang

doi:10.1186/1423-0127-16-25

Abstract

Different microarray techniques recently have been successfully used to investigate useful information for cancer diagnosis at the gene expression level due to their ability to measure thousands of gene expression levels in a massively parallel way. One important issue is to improve classification performance of microarray data. However, it would be ideal that influential genes and even interpretable rules can be explored at the same time to offer biological insight.Introducing the concepts of system design in software engineering, this paper has presented an integrated and effective method (named X-AI) for accurate cancer classification and the acquisition of knowledge from DNA microarray data. This method included a feature selector to systematically extract the relative important genes so as to reduce the dimension and retain as much as possible of the class discriminatory information. Next, diagonal quadratic discriminant analysis (DQDA) was combined to classify tumors, and generalized rule induction (GRI) was integrated to establish association rules which can give an understanding of the relationships between cancer classes and related genes.Two non-redundant datasets of acute leukemia were used to validate the proposed X-AI, showing significantly high accuracy for discriminating different classes. On the other hand, I have presented the abilities of X-AI to extract relevant genes, as well as to develop interpretable rules. Further, a web server has been established for cancer classification and it is freely available at .

Highlights

The challenge of cancer treatment is to develop specific therapies based on distinct tumor types, to maximize efficacy and minimize toxicity
Datasets and pre-processing I used two different leukemia datasets for the following reasons: (i) both datasets have been analyzed and discussed in many literatures, which is helpful to compare with their results; (ii) the rules extracted from the similar cancer type of datasets could be compared to each other; (iii) the robustness of classification system could be observed by the datasets that are obtained from different experiments; and (iv) the two datasets represent the nature of the binary classification and multi-class problems, which is useful to evaluate the effectiveness of the proposed method for different classification problems
The first acute leukemia data of Golub et al [1] is composed of 72 samples from two different types of acute leukemia, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML)

Summary

Introduction

The challenge of cancer treatment is to develop specific therapies based on distinct tumor types, to maximize efficacy and minimize toxicity. Other methods includes hierarchical clustering [2], machining learning [3,4], compound covariate [5], shrunken centroids [6], partial least square [7], principal component analysis disjoint models [8], factor mixture models [9], consensus analysis of multiple classifiers using non-repetitive variables [10] etc. On the whole, these methods are mostly concentrated in the improvement of accuracy rather than other issues

Methods

Results

Conclusion