This paper introduces a novel methodology for multiple mean comparison of clusters identified in gene expression data through the t-distributed Stochastic Neighbor Embedding (t-SNE) plot, which is a powerful dimensionality re- duction technique for visualizing high-dimensional gene expression data. Our approach integrates the t-SNE visualization with rigorous statistical testing to validate the differences between identified clusters, bridging the gap between exploratory and confirmatory data analysis. We applied our methodology to two real-world gene expression datasets for which the t-SNE plots provided clear separation of clusters corresponding to different expression levels. Our findings underscore the value of combining the t-SNE visualization with multiple mean comparison in gene expression analysis. This integrated approach enhances the interpretability of complex data and provides a robust statistical framework for validating observed patterns. While the classical MANOVA method can be applied to the same multiple mean comparison, it requires a larger total sample size than the data dimension and mostly relies on an asymptotic null distribution. The proposed approach in this paper has broad applicability in the case of high dimension with small sample sizes and an exact null distribution of the test statistic. Objective: Propose a two-step approach to analysis of gene expression data. Gene expression data usually possess a complicated nonlinear structure that cannot be visualized under simple linear dimension reduction like the principal component analysis (PCA) method. We propose to employ the existing t-SNE approach to dimension reduction first so that clusters among data can be clearly visualized and then multiple mean comparison methods can be further employed to carry out statistical inference. We propose the PCA-type projected exact F-test for multiple mean comparison among the clusters. It is superior to the classical MANOVA method in the case of high dimension and relatively large number of clusters. Results: Based on a simple Monte Carlo study on a comparison between the projected F-test and the classical MANOVA Wilks’ Lambda-test and an illustration of two real datasets, we show that the projected F-test has better empirical power performance than the classical Wilks’ Lambda-test. After applying the t-SNE plot to real gene expression data, one can visualize the clear cluster structure. The projected F-test further enhances the interpretability of the t-SNE plot, validating the significant differences among the visualized clusters. Conclusion: Our findings suggest that the combination of the t-SNE visualization and multiple mean comparison through the PCA-projected exact F-test is a valuable tool for gene expression analysis. It not only enhances the interpretability of high-dimensional data but also provides a rigorous statistical framework for validating the observed patterns.
Read full abstract