Abstract

This thesis addresses two important problems in modern statistics: discriminant analysis of big data and dimension reduction of high-dimensional data such as microarray gene expression data. These problems are commonly encountered in various scientific fields and can pose considerable challenges since traditional approaches might not work properly or even break down in the high-dimensional setting. For the first problem of discriminant analysis of big data, one of the widely used parametric approaches is to model the distribution of the feature vector in each of the predefined classes via a normal mixture distribution. The component-covariance matrices in the normal mixture for a class are highly parameterized, thus, rendering them impractical for high-dimensional datasets. Therefore, as the dimension increases, some forms of regularization need to be implemented. In this thesis, an innovative factor model approach, called mixtures of common factor analyzers for discriminant analysis (MCFDA), is proposed. With this approach, the component-covariance matrices are taken to have a factor-analytic form with common loadings across the classes (common before the transformation of the factors into white noise). This approach also allows the data to be viewed in low-dimensional spaces by plotting the (estimated) values of the latent factors corresponding to the observed data points. To improve the robustness of our MCFDA approach for data which have heavy tails or atypical observations, we also adopt the multivariate t-family for the component-error and factor distributions. We refer to this model as the mixtures of common t-factor analyzers for discriminant analysis (MCtFDA). With this approach, both the common factor loadings and the diagonal matrix of error terms need to be specified as the same across the classes. This approach has great flexibility for modelling data which are non-normal or with outliers. For the second problem of dimension reduction, we focus on the microarray setting where the dimension is extremely large. Due to the extremely high dimensionality, the traditional factor models and our proposed MCFDA model cannot be applied directly to the data. Some dimension reduction approaches need to be undertaken before implementing the MCFDA procedure. We present a novel approach by incorporating dimension reduction into our MCFDA method. We first study various dimension reduction techniques and present a systematic classification of these approaches into two families, namely, screening and clustering. We then propose the MCFDA-screening and MCFDA-clustering approaches by adopting screening and clustering techniques, respectively. The performance of the new approaches is illustrated by a number of real datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call