Abstract

Generating the raw data for a de novo genome assembly project for a target eukaryotic species is relatively easy. This democratization of access to large-scale data has allowed many research teams to plan to assemble the genomes of non-model organisms. These new genome targets are very different from the traditional, inbred, laboratory-reared model organisms. They are often small, and cannot be isolated free of their environment – whether ingested food, the surrounding host organism of parasites, or commensal and symbiotic organisms attached to or within the individuals sampled. Preparation of pure DNA originating from a single species can be technically impossible, but assembly of mixed-organism DNA can be difficult, as most genome assemblers perform poorly when faced with multiple genomes in different stoichiometries. This class of problem is common in metagenomic datasets that deliberately try to capture all the genomes present in an environment, but replicon assembly is not often the goal of such programs. Here we present an approach to extracting, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. We use both numerical (proportion of GC bases and read coverage) and biological (best-matching sequence in annotated databases) indicators to aid partitioning of draft assembly contigs, and the reads that contribute to those contigs, into distinct bins that can then be subjected to rigorous, optimized assembly, through the use of taxon-annotated GC-coverage plots (TAGC plots). We also present Blobsplorer, a tool that aids exploration and selection of subsets from TAGC-annotated data. Partitioning the data in this way can rescue poorly assembled genomes, and reveal unexpected symbionts and commensals in eukaryotic genome projects. The TAGC plot pipeline script is available from https://github.com/blaxterlab/blobology, and the Blobsplorer tool from https://github.com/mojones/Blobsplorer.

Highlights

  • The raw power of new sequencing methods has permitted the expansion of genome science into a wide range of new biological systems

  • The reduced complexity dataset is easier to screen, partly because of the smaller number of analytic steps needed, and because the longer sequences are a better substrate for assessment of numerical (GC proportion, coverage) and biological metrics

  • There is no need to extensively scaffold the assembly, and we have used mate-pair data given to the assembler as “single-end” for TAGC plot analyses in the D. immitis example

Read more

Summary

Introduction

The raw power of new sequencing methods has permitted the expansion of genome science into a wide range of new biological systems. In particular the technologies permit genome sampling from wild organisms and communities of organisms This approach was unthinkable in the era of Sanger-sequenced genomes, as the per-base cost precluded deep sampling of mixed starting materials in order to assemble the genome or transcriptome of a particular target organism. Even free-living nematodes, feeding on bacteria or fungi, can come with attached or ingested food, as difficult-toremove biofilms, or sequestered in the animals’ intestines. These mixed samples are akin to low-complexity metagenomes, where a metagenome samples all the replicons present in an ecological sample. We have frequently observed DNA samples that are “contaminated” with the genomes of other species: components of food, commensal organisms, parasites and pathogens, or laboratory contaminants. It is common to observe bacterial genomic contamination of eukaryotic samples

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.