Abstract
BackgroundExogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in next-generation sequencing libraries derived from this system. If not properly recognized and managed, this cross-contamination with exogenous signal can lead to incorrect interpretation of research results. Yet, this problem is not routinely addressed in current sequence processing pipelines.ResultsWe present cDNA-detector, a computational tool to identify and remove exogenous cDNA contamination in DNA sequencing experiments. We demonstrate that cDNA-detector can identify cDNAs quickly and accurately from alignment files. A source inference step attempts to separate endogenous cDNAs (retrocopied genes) from potential cloned, exogenous cDNAs. cDNA-detector provides a mechanism to decontaminate the alignment from detected cDNAs. Simulation studies show that cDNA-detector is highly sensitive and specific, outperforming existing tools. We apply cDNA-detector to several highly-cited public databases (TCGA, ENCODE, NCBI SRA) and show that contaminant genes appear in sequencing experiments where they lead to incorrect coverage peak calls.ConclusionscDNA-detector is a user-friendly and accurate tool to detect and remove cDNA detection in NGS libraries. This two-step design reduces the risk of true variant removal since it allows for manual review of candidates. We find that contamination with intentionally and accidentally introduced cDNAs is an underappreciated problem even in widely-used consortium datasets, where it can lead to spurious results. Our findings highlight the importance of sensitive detection and removal of contaminant cDNA from NGS libraries before downstream analysis.
Highlights
Exogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in nextgeneration sequencing libraries derived from this system
Conclusions cDNA-detector is a highly sensitive and accurate method to identify cDNA inserts in Next-generation sequencing (NGS) libraries and remove contaminant reads if desired. cDNA-detector does not remove vector backbone sequences unmapped to the reference genome; other methods exist for this purpose (e.g. DeconSeq, SeqtrimNext)
Custom amplicons used in the laboratory or sequencing facility should be added to the gene model to optimize detection sensitivity
Summary
Exogenous cDNA introduced into an experimental system, either intentionally or accidentally, can appear as added read coverage over that gene in nextgeneration sequencing libraries derived from this system. Trace amounts of vector DNA from other experiments performed in the same laboratory or on shared equipment can contaminate the DNA library, leading to the presence of foreign genes in the derived alignment. Signal from these contaminant genes can affect downstream results and interpretation, for example through spurious increased copy number [1, 2], false germline or somatic variant calls [3, 4], or incorrect coverage peak calls [3,4,5]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.