Abstract

BackgroundHigh-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. Analyzing the network itself or the interplay between genes in this type of data continues to present many challenges. As data visualization techniques become cumbersome for higher dimensions and unconvincing when there is no clear separation between homogeneous subgroups within the data, cluster analysis provides an intuitive alternative. The aim of applying mixture model-based clustering in this context is to discover groups of co-expressed genes, which can shed light on biological functions and pathways of gene products.ResultsA mixture of multivariate Poisson-log normal (MPLN) model is developed for clustering of high-throughput transcriptome sequencing data. Parameter estimation is carried out using a Markov chain Monte Carlo expectation-maximization (MCMC-EM) algorithm, and information criteria are used for model selection.ConclusionsThe mixture of MPLN model is able to fit a wide range of correlation and overdispersion situations, and is suited for modeling multivariate count data from RNA sequencing studies. All scripts used for implementing the method can be found at https://github.com/anjalisilva/MPLNClust.

Highlights

  • High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies

  • The mixture model-based clustering method based on multivariate Poisson-log normal (MPLN) distributions is an excellent tool for analysis of RNA sequencing (RNA-seq) data

  • The MPLN distribution is able to describe a wide range of correlation and overdispersion situations, and is ideal for modeling RNA-seq data, which is generally overdispersed

Read more

Summary

Introduction

High-dimensional data of discrete and skewed nature is commonly encountered in high-throughput sequencing studies. RNA sequencing (RNA-seq) is used to determine the transcriptional dynamics of a biological system by measuring the expression levels of thousands of genes simultaneously [1, 2]. This technique provides counts of reads that can be mapped back to a biological entity, such as a gene or an exon, which is a measure of the gene’s expression under experimental conditions. Analyzing RNA-seq data is challenged by several factors, including the nature of the data, which is characterized by high dimensionality, skewness, and presence of a dynamic range that may vary from zero to over a million counts. Cluster analysis is often performed as part of downstream analysis to identify key features between observations

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call