A comprehensive simulation study on classification of RNA-Seq data.

Gökmen Zararsız,Vahap Eldem,Ahmet Ozturk,Izzet Parug Duru,Dincer Goksuluk,Selcuk Korkmaz,Gozde Erturk Zararsiz

doi:10.1371/journal.pone.0182507

Gökmen Zararsız, Vahap Eldem + Show 5 more

Open Access

https://doi.org/10.1371/journal.pone.0182507

Copy DOI

Journal: PloS one	Publication Date: Aug 23, 2017
Citations: 30	License type: CC BY 4.0

Affiliation: Erciyes University, Istanbul University, Marmara University

Abstract

RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.

Highlights

Transcriptome sequencing (RNA-Seq), with the advent of high-throughput Nextgeneration sequencing (NGS) technologies, has become a popular experimental approach for generating a comprehensive catalog of protein-coding genes and non-coding Ribonucleic acid (RNA) and examining the transcriptional activity of genomes
RNA sequencing (RNA-Seq) is a promising tool with a remarkably wide range of applications such that (i) discovering novel transcripts, (ii) detecting/quantifying the spliced isoforms, (iii) fusion detection and (iv) revealing sequence variations (e.g, SNPs, indels) [1]. Beyond these common applications, RNA-Seq can be a method of choice for gene-expressionbased classification to identify the significant transcripts, distinguishing biological samples and predicting the outcomes from large-scale gene-expression data which can be generated in a single run.This classification is widely used in medicine for diagnostic purposesand refers to the detection of a small subset of genes that achieves the maximal predictive performance
It is seen from the figure that the cervical and Alzheimer miRNA datasets are very highly overdispersed (φ>1), while the lung and renal cell cancer datasets are substantially overdispersed

Summary

Introduction

Transcriptome sequencing (RNA-Seq), with the advent of high-throughput NGS technologies, has become a popular experimental approach for generating a comprehensive catalog of protein-coding genes and non-coding RNAs and examining the transcriptional activity of genomes. RNA-Seq is a promising tool with a remarkably wide range of applications such that (i) discovering novel transcripts, (ii) detecting/quantifying the spliced isoforms, (iii) fusion detection and (iv) revealing sequence variations (e.g, SNPs, indels) [1] Beyond these common applications, RNA-Seq can be a method of choice for gene-expressionbased classification to identify the significant transcripts, distinguishing biological samples and predicting the outcomes from large-scale gene-expression data which can be generated in a single run.This classification is widely used in medicine for diagnostic purposesand refers to the detection of a small subset of genes that achieves the maximal predictive performance. Various studies have been conducted to deal with the overdispersion problem for the differential-expression (DE) analysis of RNA-Seq data [8,9,10,11,12]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A comprehensive simulation study on classification of RNA-Seq data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one

Lead the way for us

Similar Papers

MLSeq: Machine learning interface for RNA-sequencing data
Dincer Goksuluk ... Ahmet Ergun Karaagaoglu
Computer Methods and Programs in Biomedicine | VOL. 175
Dincer Goksuluk, et. al.Dincer Goksuluk ... Ahmet Ergun Karaagaoglu
29 Apr 2019
Computer Methods and Programs in Biomedicine | VOL. 175

ScDLC: a deep learning framework to classify large sample single-cell RNA-seq data
Yan Zhou ... Tiejun Tong
BMC Genomics | VOL. 23
Yan Zhou, et. al.Yan Zhou ... Tiejun Tong
12 Jul 2022
BMC Genomics | VOL. 23

Classification Algorithms Enhance the Discrimination of Glaucoma from Normal Eyes Using High-Definition Optical Coherence Tomography
Mani Baskaran ... Shamira A Perera
Investigative Opthalmology & Visual Science | VOL. 53
Mani Baskaran, et. al.Mani Baskaran ... Shamira A Perera
24 Apr 2012
Investigative Opthalmology & Visual Science | VOL. 53

Algorithm construction methodology for diagnostic classification of near-infrared spectroscopy data
...
Spectroscopy | VOL. 25
, et. al. ...
01 Jan 2010
Spectroscopy | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A comprehensive simulation study on classification of RNA-Seq data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PloS one