Abstract
RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.
Highlights
Transcriptome sequencing (RNA-Seq), with the advent of high-throughput Nextgeneration sequencing (NGS) technologies, has become a popular experimental approach for generating a comprehensive catalog of protein-coding genes and non-coding Ribonucleic acid (RNA) and examining the transcriptional activity of genomes
RNA sequencing (RNA-Seq) is a promising tool with a remarkably wide range of applications such that (i) discovering novel transcripts, (ii) detecting/quantifying the spliced isoforms, (iii) fusion detection and (iv) revealing sequence variations (e.g, SNPs, indels) [1]. Beyond these common applications, RNA-Seq can be a method of choice for gene-expressionbased classification to identify the significant transcripts, distinguishing biological samples and predicting the outcomes from large-scale gene-expression data which can be generated in a single run.This classification is widely used in medicine for diagnostic purposesand refers to the detection of a small subset of genes that achieves the maximal predictive performance
It is seen from the figure that the cervical and Alzheimer miRNA datasets are very highly overdispersed (φ>1), while the lung and renal cell cancer datasets are substantially overdispersed
Summary
Transcriptome sequencing (RNA-Seq), with the advent of high-throughput NGS technologies, has become a popular experimental approach for generating a comprehensive catalog of protein-coding genes and non-coding RNAs and examining the transcriptional activity of genomes. RNA-Seq is a promising tool with a remarkably wide range of applications such that (i) discovering novel transcripts, (ii) detecting/quantifying the spliced isoforms, (iii) fusion detection and (iv) revealing sequence variations (e.g, SNPs, indels) [1] Beyond these common applications, RNA-Seq can be a method of choice for gene-expressionbased classification to identify the significant transcripts, distinguishing biological samples and predicting the outcomes from large-scale gene-expression data which can be generated in a single run.This classification is widely used in medicine for diagnostic purposesand refers to the detection of a small subset of genes that achieves the maximal predictive performance. Various studies have been conducted to deal with the overdispersion problem for the differential-expression (DE) analysis of RNA-Seq data [8,9,10,11,12]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.