Abstract

Background: One of the main goals of RNA-seq data analysis is identification of biomarkers that are differentially expressed (DE) across two or more experimental conditions. RNA-seq uses next generation sequencing technology and it has many advantages over microarrays. Numerous statistical methods have already been developed for identification the biomarkers from RNA-seq data. Most of these methods were based on either Poisson distribution or negative binomial distribution. However, efficient biomarker identification from discrete RNA-seq data is hampered by existing methods when the datasets contain outliers or extreme observations. Specially, the performance of these methods becomes more severe when the data come from a small number of samples in the presence of outliers. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to overcome the aforesaid problems of traditional methods. We make our proposed method facilitate in RNA-seq data by transforming the read count data into continuous data. Methods: We use median control chart to detect and modify the outlying observation in a log-transformed RNA-seq dataset. To investigate the performance of the proposed method in absence and presence of outliers, we employ the five popular biomarker selection methods (edgeR, edgeR_robust, DEseq, DEseq2 and limma) both in simulated and real datasets. Results: The simulation results strongly suggest that the performance of the proposed method improved in the presence of outliers. The proposed method also detected an additional 18 outlying DE genes from a real mouse RNA-seq dataset that were not detected by traditional methods. Using the KEGG pathway and gene ontology analysis results we reveal that these genes may be biomarkers, which require validation in a wet lab. Conclusions: Our proposal is to apply the proposed method for biomarker identification from other RNA-seq data.

Highlights

  • One of the major objectives of researchers is to identify biomarkers from RNA-Seq data that are differentially expressed (DE) between two or more experimental conditions

  • Performance evaluation In order to evaluate the performance of different biomarkers selection methods we considered the area under the receiver operating characteristic curve (ROC) curve

  • Biomarker identification under two or more conditions is an important task for elucidating the molecular basis of phenotypic variation

Read more

Summary

Introduction

One of the major objectives of researchers is to identify biomarkers from RNA-Seq data that are differentially expressed (DE) between two or more experimental conditions. Outliers may arise in RNA-seq count data because there are several data generating stages from biological harvesting of RNA samples to counting of sequence read map data[13] To mitigate this issue many algorithms use transformation methods. There are several transformation methods for RNA-seq data: logarithmic transformation[14], variance-stabilizing transformation (vst)[6], TMM transformation[15], regularized logarithm[8] and variance modeling at the observation level (voom)[16] These methods only reduce the low level outliers into reasonable spaces during parameter estimations; they fail to reduce the influence of high level outliers with small sample sizes in the data matrix. In this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to improve the performance of the popular biomarker selection methods in the presence of outliers. In Results and Conclusions a broad simulation study and a real data study have been carried out

Methods
Results
Conclusions
21. Shahjaman Md
23. Shahjaman Md
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call