Robust principal component analysis for accurate outlier sample detection in RNA-Seq data

Xiaoying Chen,Bo Zhang,Ting Wang,Azad Bonni,Guoyan Zhao

doi:10.1186/s12859-020-03608-0

Abstract

BackgroundHigh throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis.ResultsWe report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes.ConclusionsrPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.

Highlights

IntroductionDue to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences
High throughput RNA sequencing is a powerful approach to study gene expression
We simulated an RNA-Seq data set for two treatment groups with 3 biological replicates each using Polyester [26] and used it as the baseline sample set (Fig. 1a)

Summary

Introduction

Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. We apply robust statistics on RNA-seq data analysis. High throughput mRNA sequencing, known as RNA-seq [3], has emerged as a powerful approach of transcriptome profiling to detect genes differentially expressed (DEGs) between two experimental groups. True biological differences or technical failures during the process of sample preparation could lead to extreme deviation of a sample from samples of the same treatment group (biological replicates). We refer to these samples as “outliers”. It has been shown that both “batch effects” and technical “outliers” can be detrimental to the quality of the data and affect downstream analyses [5,6,7]

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Jun 29, 2020
Citations: 55	License type: open-access

R Discovery Prime

R Discovery Prime

Robust principal component analysis for accurate outlier sample detection in RNA-Seq data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

Robust vs. classical principalcomponent analysis in the presence of outliers
Sunil K Sapra
Applied Economics Letters | VOL. 17
Sunil K SapraSunil K Sapra
14 Apr 2010
Applied Economics Letters | VOL. 17

Entropy-based robust PCA for communication network anomaly detection
Duo Liu ... Biswajit Nandy
-
Duo Liu, et. al.Duo Liu ... Biswajit Nandy
01 Oct 2014
01 Oct 2014

Quality Outlier Detection for Tobacco Based on Robust Sparse PCA: Advantages and Limitations
Juan Huo ... Qian Li
-
Juan Huo, et. al.Juan Huo ... Qian Li
21 Oct 2022
21 Oct 2022

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments
Mikel Esnaola ... Juan R Gonzalez
BMC Bioinformatics | VOL. 14
Mikel Esnaola, et. al.Mikel Esnaola ... Juan R Gonzalez
21 Aug 2013
BMC Bioinformatics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Robust principal component analysis for accurate outlier sample detection in RNA-Seq data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics