Abstract

Sample- and gene- based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure. We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.

Highlights

  • Hierarchical cluster analysis has been a popular method for finding patterns in data and for representing results of gene expression analysis [1]

  • We propose the use of plasmode datasets to assess the properties of dissimilarity measures for agglomerative hierarchical clustering or RNA sequencing technology (RNA-seq) data

  • The dendrogram based on Euclidean distance calculated from raw normalized data (Fig 3b) mixed treatment labels and did not recover any expected structure

Read more

Summary

Introduction

Hierarchical cluster analysis has been a popular method for finding patterns in data and for representing results of gene expression analysis [1]. Clustering algorithms have been widely studied for analyzing microarray data [2,3], such technology is being rapidly replaced by RNA sequencing technology (RNA-seq) [4]. In contrast to microarray experiments, RNAseq generates count data of discrete nature that may call for different analysis methods. Before implementing any statistical analysis of RNA-seq data, normalization and transformation have to be performed. Data transformation could be very important because it aims at reducing the effects of skewness, scale and presence of outliers that can be found in read count data that usually follow a Poisson [7] or negative binomial distribution [8,9]. Dissimilarity measures that are sensitive to asymmetric distributions and scale magnitude, such as Euclidean and 1 –Pearson correlation [1,2,10] could be used for clustering RNA-seq data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.