Abstract

Modern genome sequencing strategies are highly sensitive to contamination making the detection of foreign DNA sequences an important part of analysis pipelines. Here we use Taxoblast, a simple pipeline with a graphical user interface, for the post-assembly detection of contaminating sequences in the published genome of the kelp Saccharina japonica. Analyses were based on multiple blastn searches with short sequence fragments. They revealed a number of probable bacterial contaminations as well as hybrid scaffolds that contain both bacterial and algal sequences. This or similar types of analysis, in combination with manual curation, may thus constitute a useful complement to standard bioinformatics analyses prior to submission of genomic data to public repositories. Our analysis pipeline is open-source and freely available at http://sdittami.altervista.org/taxoblast and via SourceForge (https://sourceforge.net/projects/taxoblast).

Highlights

  • Modern genome sequencing strategies rely strongly on the amplification of low quantities of deoxyribonucleic acid (DNA), making them highly sensitive to even small contaminations in the samples

  • In our case, the approach of splitting scaffolds into small fragments prior to blast searches resulted in the identification of several hybrid scaffolds and of approximately 8 Mbp of contaminant bacterial sequences that had previously been missed

  • Scaffolds >2kb were considered for our analyses, and blastn searches were carried out against the national center for biotechnology information (NCBI) nucleotide database with an e-value cutoff of 0.01

Read more

Summary

Introduction

Modern genome sequencing strategies rely strongly on the amplification of low quantities of deoxyribonucleic acid (DNA), making them highly sensitive to even small contaminations in the samples. A study by Longo, O’Neill & O’Neill (2011), for example, has shown almost 1/4th of non-primate genomes available in the national center for biotechnology information (NCBI) databases to be contaminated by repeated elements frequently found in human cells. Samples may be contaminated by airborne bacteria or other eukaryotes, ingested food, or symbionts living within or attached to the target organism. The detection of contaminants in genome datasets may be accomplished pre-assembly, post-assembly, or using a combination of both approaches. Pre-assembly removal of potential contaminants has the advantage of reducing the complexity of the assembly process by producing smaller and more homogenous data sets. A first step may be filtering according to kmer-coverage or according to per-read guanine-cytosine (GC) contents (Schmieder & Edwards, 2011) or applying more advanced binning techniques based on

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call