ALL/AML Cancer Classification by Gene Expression Data Using SVM and CSVM Approach

Xuegong Zhang ,Haixin Ke

doi:10.11234/gi1990.11.237

Abstract

Cancer classification plays an important role in cancer treatment. There has been no general approach for this problem now. The tasks for cancer classification are of two aspects: identifying new cancer classes and assigning tumors to known classes, which are called class discovery and class prediction by Golub et al. [1]. From mathematical point of view, class discovery is a cluster analysis problem, while class prediction is usually called classification problem (we’ll use the later name to keep consist with pattern recognition literatures). Until now, cancer classification has been based primarily on morphological appearance of tumor [1]. This has serious limitations because of ambiguity. Golub et al. presented a new approach to cancer classification based on gene expression monitoring by DNA microarrays in [1]. They chose acute leukemia as a test case, and the target is to distinguish between ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia), which is a typical cancer classification problem not well solved despite many years of efforts. This paper is a report of our work on the classification (prediction) part of this problem following their original work. Golub et al. adopted a feature selection (gene selection) procedure before classification. A metric was defined to evaluate the correlation of each gene to the classification. After some “good” genes were selected from all the 6817 genes, the classification is done by a weighted voting scheme. The classifier was trained on a 38-sample training set, and another 34-sample set was used for testing. With leave-one-out cross-validation on the training set with 50 selected genes, 36 out of 38 samples were correctly classified and 2 were rejected (no-call). The performance on the test set was that 29 samples out of 34 were correctly classified and the other 5 were rejected. If the classifier were compelled to give these 5 no-calls a prediction, the prediction would be wrong. Since the feature selection procedure is of single selection type, and the classification method is also an intuitive one, we believe that there is still much space for the performance to be improved. In our approach to the problem, we took all the genes for the classification (the selection problem will be discussed in another paper), and applied the support vector machine(SVM) method and one of its improved version CSVM as the classifier. Thanks to the better generalization ability of SVM and CSVM, much better performance was obtained.

Full Text