Classification of array CGH data using smoothed logistic regression model

Jian Huang,Kathleen O'Sullivan,Kaibin Lei,Yudi Pawitan,Agus Salim

doi:10.1002/sim.3753

Abstract

Array comparative genomic hybridization (aCGH) provides a genome-wide information of DNA copy number that is potentially useful for disease classification. One immediate problem is that the data contain many features (probes) but only a few samples. Existing approaches to overcome this problem include features selection, ridge regression and partial least squares. However, these methods typically ignore the spatial characteristic of aCGH data. To explicitly make use of this spatial information we develop a procedure called smoothed logistic regression (SLR) model. The procedure is based on a mixed logistic regression model, where the random component is a mixture distribution that controls smoothness and sparseness. Conceptually such a procedure is straightforward, but its implementation is complicated due to computational problems. We develop a fast and reliable iterative weighted least-squares algorithm based on the singular value decomposition. Simulated data and two real data sets are used to illustrate the procedure. For real data sets, error rates are calculated using the leave-one-out cross validation procedure. For both simulated and real data examples, SLR achieves better misclassification error rates compared with previous methods.

Full Text