Abstract

Quality control, global biases, normalization, and analysis methods for RNA-Seq data are quite different than those for microarray-based studies. The assumption of normality is reasonable for microarray based gene expression data; however, RNA-Seq data tend to follow an over-dispersed Poisson or negative binomial distribution. Little research has been done to assess how data transformations impact Gaussian model-based clustering with respect to clustering performance and accuracy in estimating the correct number of clusters in RNA-Seq data. In this article, we investigate Gaussian model-based clustering performance and accuracy in estimating the correct number of clusters by applying four data transformations (i.e., naïve, logarithmic, Blom, and variance stabilizing transformation) to simulated RNA-Seq data. To do so, an extensive simulation study was carried out in which the scenarios varied in terms of: how genes were selected to be included in the clustering analyses, size of the clusters, and number of clusters. Following the application of the different transformations to the simulated data, Gaussian model-based clustering was carried out. To assess clustering performance for each of the data transformations, the adjusted rand index, clustering error rate, and concordance index were utilized. As expected, our results showed that clustering performance was gained in scenarios where data transformations were applied to make the data appear “more” Gaussian in distribution.

Highlights

  • The analysis of RNA-Seq data comes with some different and additional challenges, as compared to microarray based data

  • Current literature contains three closely related studies that have looked at performance of clustering methods for sequence data: the first one investigated clustering of sequencing data using a Poisson log-linear model [16]; the second looked at consistency of results from differential expression and clustering analyses between the two technologies for assessing mRNA using a variety of statistical methods [17]; and the last study provided a model-based clustering framework for determining groups or sets of differentially expressed genes using RNA-Seq data [18]

  • Model-based clustering of RNA-Seq data transformations that improve clustering performance, “real-life” data parameters were acquired from the 55 high-grade serous histology tumor samples in batch 1 which were selected due to their commonness, aggressive nature, and uncertainty surrounding the number of potential subtypes present within this histology—ranging from two to five subtypes [4,5,6, 22, 23]

Read more

Summary

Introduction

The analysis of RNA-Seq data comes with some different and additional challenges, as compared to microarray based data. Current literature contains three closely related studies that have looked at performance of clustering methods for sequence data: the first one investigated clustering of sequencing data using a Poisson log-linear model [16]; the second looked at consistency of results from differential expression and clustering analyses between the two technologies for assessing mRNA (microarray and sequencing) using a variety of statistical methods [17]; and the last study provided a model-based clustering framework for determining groups or sets of differentially expressed genes using RNA-Seq data [18]. We set out to evaluate how the commonly used Gaussian model-based clustering method performs when applied to RNA-Seq data after a variety of data transformations were applied, with the ultimate goal of clustering subjects/individuals in to distinct molecular subgroups

Materials and methods
Results
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call