Abstract

BackgroundHigh-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. Currently, there are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism.ResultsWe introduce a novel application of an established machine learning method, a decision tree, that can rigorously classify sequences. The major strength of the decision tree is that it can take any measured feature as input and does not require a priori identification of significant descriptors. We use the decision tree to classify de novo assembled sequences and compare the method to published protocols.ConclusionsA decision tree performs better than existing methods when classifying sequences in eukaryotic de novo assemblies. It is efficient, readily implemented, and accurately identifies target and contaminant sequences. Importantly, a decision tree can be used to classify sequences according to measured descriptors and has potentially many uses in distilling biological datasets.

Highlights

  • High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms

  • We included A. vaga to determine if a decision tree could accurately separate foreign DNA from horizontally transferred DNA in an organism with high levels of confirmed horizontal gene transfer (HGT) [35,36,37]

  • In order to test the methods on a range of genome-contaminant data structures we simulated genomic and transcriptomic libraries from the published gene sequences of the plant Arabidopsis thaliana, the nematode C. elegans, the fruitfly D. melanogaster, and the pufferfish Takifugu rubripes

Read more

Summary

Introduction

High-throughput sequencing has made it theoretically possible to obtain high-quality de novo assembled genome sequences but in practice DNA extracts are often contaminated with sequences from other organisms. There are few existing methods for rigorously decontaminating eukaryotic assemblies. Those that do exist filter sequences based on nucleotide similarity to contaminants and risk eliminating sequences from the target organism. Most organisms do not live in sterile environments and extracted DNA may be contaminated with foreign DNA from associated microbiota [1,2,3] and endosymbionts [4]. Contaminants end up sequenced and assembled along with the DNA of the target organism and, if not eliminated, will become part of the assembled genome sequence. Crisp et al [11] analyzed horizontal gene transfer (HGT) in 40 metazoan genomes but excluded 9 from HGT analyses due to extensive contamination

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call