RNA-Seq has emerged as a powerful technique for transcriptome study. As much as the improved sensitivity and coverage, RNA-Seq also brings about challenges for data analysis. The massive amount of sequence reads data, excessive variability, uncertainties, and bias and noises stemming from multiple sources all make the analysis of RNA-Seq data difficult. Despite much progress, RNA-Seq data analysis still has much room for improvement, especially on the quantification of transcript/gene expression levels. In this article, using finite Poisson mixture models, we propose a two-step approach, called PM-Seq, to characterizing base pair level RNA-Seq data and quantifying transcript/gene expression levels. Finite Poisson mixture models combine the strength of fully parametric models with the flexibility of fully nonparametric models, and are extremely suitable for modeling heterogeneous count data such as RNA-Seq data. In particular, we consider three types of Poisson mixture model and propose to use a BIC-based model selection procedure to adapt the models to individual transcripts. A unified quantification method based on the Poisson mixture models is developed to measure transcript/gene expression levels. The Poisson mixture models and the proposed quantification method were applied to analyze two RNA-Seq data sets and demonstrated excellent performances in comparison with other existing methods. Our approach resulted in better characterization of the data and more accurate measurements of transcript expression levels. We believe that finite Poisson mixture models provide a flexible framework to model RNA-Seq data, and methods developed based on this framework have the potential to become powerful tools for RNA-Seq data analysis.
Read full abstract