Abstract

Non-normality is a usual fact when dealing with gene expression data. Thus, flexible models are needed in order to account for the underlying asymmetry and heavy tails of multivariate gene expression measures. This paper addresses the issue by exploring the projection pursuit problem under a flexible framework where the underlying model is assumed to follow a multivariate skew-t distribution. Under this assumption, projection pursuit with skewness and kurtosis indices is addressed as a natural approach for data reduction. The work examines its properties giving some theoretical insights and delving into the computational side in regards to the application to real gene expression data. The results of the theory are illustrated by means of a simulation study; the outputs of the simulation are used in combination with the theoretical insights to shed light on the usefulness of skewness-kurtosis projection pursuit for summarizing multivariate gene expression data. The application to gene expression measures of patients diagnosed with triple-negative breast cancer gives promising findings that may contribute to explain the heterogeneity of this type of tumors.

Highlights

  • The development of high-throughput technologies has provided the scenario to simultaneously monitor the expression levels of hundreds of genes in an attempt to obtain insights about the molecular mechanisms of human diseases

  • The results of our analysis reveal the limitations of the normal distribution for modeling multivariate gene expression data

  • The BIC criterion led to a four group model for the skewness-kurtosis projection pursuit (PP) gene features, while it resulted in three groups when principal component analysis (PCA) is used to summarize the gene expression measures

Read more

Summary

Introduction

The development of high-throughput technologies has provided the scenario to simultaneously monitor the expression levels of hundreds of genes in an attempt to obtain insights about the molecular mechanisms of human diseases. Since gene expression measures usually exhibit asymmetries and heavy tails, the normality assumption is not realistic [1,2,3,4] and dimension reduction methods based on first and second order moments entail obvious theoretical limitations. Ω p ) with non-negative entries which converts the correlation matrix into a scale matrix Ω = ωΩω; as a result, the vector X = ξ + ωZ has a SN distribution with density function f ( x; ξ, Ω, α) = 2φ p ( x − ξ; Ω)Φ(α> ω−1 ( x − ξ )) : x ∈ Rp. The parameters in the density above are the location ξ, the scale matrix Ω and the shape vector α, or η = ω−1 α, which regulates the multivariate asymmetry of the model.

Objectives
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call