Abstract

BackgroundContaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories. Strikingly, analysis workflows for whole-genome sequencing (WGS) data commonly do not account for errors potentially introduced by contamination, which could lead to the wrong assessment of allele frequency both in basic and clinical research.ResultsWe used a taxonomic filter to remove contaminant reads from more than 4000 bacterial samples from 20 different studies and performed a comprehensive evaluation of the extent and impact of contaminant DNA in WGS. We found that contamination is pervasive and can introduce large biases in variant analysis. We showed that these biases can result in hundreds of false positive and negative SNPs, even for samples with slight contamination. Studies investigating complex biological traits from sequencing data can be completely biased if contamination is neglected during the bioinformatic analysis, and we demonstrate that removing contaminant reads with a taxonomic classifier permits more accurate variant calling. We used both real and simulated data to evaluate and implement reliable, contamination-aware analysis pipelines.ConclusionAs sequencing technologies consolidate as precision tools that are increasingly adopted in the research and clinical context, our results urge for the implementation of contamination-aware analysis pipelines. Taxonomic classifiers are a powerful tool to implement such pipelines.

Highlights

  • Contaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories

  • We demonstrate that removing contaminant reads with a taxonomic classifier allows the implementation of more accurate variant calling pipelines, and provide a validated workflow for whole-genome sequencing (WGS) analysis of Mycobacterium tuberculosis (MTB)

  • Contamination is common across WGS studies, even when sequencing from pure cultures To assess the extent of contamination across bacterial WGS studies, we taxonomically classified the sequencing reads of 4194 WGS samples from 20 different studies using Kraken, a metagenomic read classifier that has been extensively used and evaluated in the literature

Read more

Summary

Introduction

Contaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories. Analysis workflows for whole-genome sequencing (WGS) data commonly do not account for errors potentially introduced by contamination, which could lead to the wrong assessment of allele frequency both in basic and clinical research. Many efforts in the basic and clinical research fields are directed to the improvement of bioinformatic pipelines to ensure the robustness of the conclusions drawn. Central to many bacterial WGS bioinformatic pipelines is the identification of genetic variants. While many factors are taken into account when developing SNP calling pipelines, surprisingly, the potential role of contamination is seldomly considered [13]. Misinterpretation of contaminated data can lead to draw incorrect conclusions about biological phenomena [14, 15]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call