Abstract

As the number of genomics datasets grows rapidly, sample mislabeling has become a high stakes issue. We present CrosscheckFingerprints (Crosscheck), a tool for quantifying sample-relatedness and detecting incorrectly paired sequencing datasets from different donors. Crosscheck outperforms similar methods and is effective even when data are sparse or from different assays. Application of Crosscheck to 8851 ENCODE ChIP-, RNA-, and DNase-seq datasets enabled us to identify and correct dozens of mislabeled samples and ambiguous metadata annotations, representing ~1% of ENCODE datasets.

Highlights

  • As the number of genomics datasets grows rapidly, sample mislabeling has become a high stakes issue

  • The common logic behind these tools is that each genome harbors a unique set of single-nucleotide polymorphisms (SNPs) that are shared between datasets originating from the same donor

  • We set out to develop a method for quantifying sample relatedness that was both robust to shallow sequencing depth and that could be systematically applied to modern large-scale projects incorporating multiple data types

Read more

Summary

Results

We compared each flagged mismatch to all other flagged mismatches in order to identify genetically consistent clusters and uncover patterns of mislabeling This analysis uncovered three major categories of mislabeling (as well as a small fraction, 0.4%, of datasets that exhibited a pattern consistent with cross-sample contamination, as described in Methods and Supplementary Fig. 3). Out of four flagged datasets labeled as K562, two were shown to derive from GM12878 cells while the other two derived from HEK293 cells This type of mislabeling may occur for primary cells or tissues when many biological samples from multiple donors are obtained from the same source, as in the case of 300 embryonic tissue samples processed by ENCODE from a single lab. We suggest it as a critical component of any NGS quality control pipeline

Methods
A ðok Þek þ
Code availability
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call