DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis.

Quanhu Sheng,Yu Shyr,Xi Chen

doi:10.1186/1471-2105-15-323

Abstract

BackgroundMeta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets. However, when using public-available databases for meta-analysis, duplication of samples is an often encountered problem, especially for gene expression data. Not removing duplicates could lead false positive finding, misleading clustering pattern or model over-fitting issue, etc in the subsequent data analysis.ResultsWe developed a Bioconductor package Dupchecker that efficiently identifies duplicated samples by generating MD5 fingerprints for raw data. A real data example was demonstrated to show the usage and output of the package.ConclusionsResearchers may not pay enough attention to checking and removing duplicated samples, and then data contamination could make the results or conclusions from meta-analysis questionable. We suggest applying DupChecker to examine all gene expression data sets before any data analysis step.Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2105-15-323) contains supplementary material, which is available to authorized users.

Highlights

Meta-analysis has become a popular approach for high-throughput genomic data analysis because it often can significantly increase power to detect biological signals or patterns in datasets
We developed a bioconductor package DupChecker that can efficiently check sample redundancy based on the raw data files of high-throughput genomic data
For users’ convenience, we developed the functions geoDownload and arrayExpressDownload to download multiple gene expression data sets from Gene Expression Omnibus (GEO) or ArrayExpress databases and deposit the files under the specified directory

Summary

Conclusions

Gene expression meta-analysis has become increasingly popular for high-throughput genomic data analysis. Due to the large amount of publicly available gene expression data contributed by different researchers, it is almost inevitable to include duplicated samples in the data sets collected for meta-analysis. Specimens or RNA samples profiled twice, whether on the sample platform or different platforms, will not be identified using DupChecker In this application note, we illustrated the application using gene expression data, but DupChecker package can be applied to other types of high-throughput genomic data including next-generation sequencing data. Additional file 1: The full result table generated by DupChecker for the colon cancer data. Additional file 2: The full result table generated by DupChecker for the breast cancer data. All authors read and approved the final manuscript

Background

Result

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Sep 30, 2014
Citations: 10	License type: cc-by

R Discovery Prime

R Discovery Prime

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

High-Dimensional Sparse Additive Hazards Regression
Wei Lin ... Jinchi Lv
Journal of the American Statistical Association | VOL. 108
Wei Lin, et. al.Wei Lin ... Jinchi Lv
26 Dec 2012
Journal of the American Statistical Association | VOL. 108

5 - Analysis of high-throughput data
Vladimir I Razinkov ... Gerd R Kleemann
High-Throughput Formulation Development of Biopharmaceuticals | VOL. -
Vladimir I Razinkov, et. al.Vladimir I Razinkov ... Gerd R Kleemann
14 Oct 2016
High-Throughput Formulation Development of Biopharmaceuticals | VOL. -

Clipper: p-value-free FDR control on high-throughput data from two conditions
Xinzhou Ge ... Kyla Woyshner
Genome Biology | VOL. 22
Xinzhou Ge, et. al.Xinzhou Ge ... Kyla Woyshner
11 Oct 2021
Genome Biology | VOL. 22

Generalized shrinkage F-like statistics for testing an interaction term in gene expression analysis in the presence of heteroscedasticity.
Jie Yang ... George Casella
BMC bioinformatics | VOL. 12
Jie Yang, et. al.Jie Yang ... George Casella
01 Nov 2011
BMC bioinformatics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics