Abstract
Most of the proposed clustering approaches are heuristic in nature. As a result, it is difficult to interpret the obtained clustering outcomes from a statistical standpoint. Mixture model-based clustering has received much attention from the gene expression community due to its sound statistical background and its flexibility in data modeling. However, current clustering algorithms following the model-based framework suffer from two serious drawbacks. First, the performance of these algorithms critically depends on the starting values for their iterative clustering procedures. And second, they are not capable of working directly with very high dimensional data sets whose dimension might be up to thousands. We propose a novel normalized Expectation-Maximization (EM) algorithm to tackle the two challenges. The normalized EM is stable even with random initializations for its EM iterative procedure. Its stability is demonstrated through the performance comparison with other related clustering algorithms such as the unnormalized EM (The conventional EM algorithm for Gaussian mixture model-based clustering) and spherical k-means. Furthermore, the normalized EM is the first mixture model-based clustering algorithm that is shown to be stable when working directly with very high dimensional microarray data sets in the sample clustering problem, where the number of genes is much larger than the number of samples. Besides, an interesting property of the convergence speed of the normalized EM with respect to the squared radius of the hypersphere in its corresponding statistical model is uncovered.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.