Abstract

Expression and methylation datasets are standard genomic techniques and an increasing number of computational methods are implemented to aid in analyzing the huge and complex amount of generated data. Such generated datasets often contain a sizeable fraction of outliers that cause misleading results in downstream analysis. Here, we present a comprehensive approach to detect sample and gene outliers in expression or methylation datasets. The core algorithms detected most outliers that were artificially introduced by us. Sample outliers detected by hierarchical clustering are validated by the Silhouette coefficient. At the gene level, the GESD, Boxplot, and MAD algorithms detected with f-measure of at least 83% the simulated outlier genes in non-intersected distributions. This combined approach detected many outliers in publicly available datasets from the TCGA and GEO portals. Frequently, some functionally similar genes marked as outliers turned out to have outlier observations in common samples. As such cases may be of special interest, they are labeled for further investigations. Expression and DNA methylation datasets should clearly be checked for outlier points before proceeding with any further analysis. We suggest that already 2 outlier observations are enough to label an outlier gene as they are enough to ruin a perfect co-expression. Besides, outliers might also carry useful information and thus functionally similar outliers should be labeled for further investigation. The presented software is freely available via github

Highlights

  • Monitoring gene expression can aid in cancer classification [1] and in identifying clinically-relevant tumor subgroups [2]

  • We found that the Generalized Extreme Studentized Deviate algorithm (GESD) detection was more stable than Boxplot and median absolute deviation (MAD) but still failed in the last case showing strong overlap

  • We presented two modules for outlier detection working at the sample and gene levels

Read more

Summary

Introduction

Monitoring gene expression can aid in cancer classification [1] and in identifying clinically-relevant tumor subgroups [2]. Profiling of gene expression is one key approach for finding new biomarkers and therapeutic targets for different cancer types [3]. Several data portals such as the Gene Expression Omnibus (GEO) [4] and The Cancer Genome Atlas (TCGA) provide convenient access to thousands of normalized expression datasets for most cancer types. An outlier might be a gene with abnormal expression values in one or more samples from the same class. It is important to identify outliers in expression datasets and, depending on the type of analysis to be performed, to consider whether this data should be removed [5]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call