Abstract

High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigrade Hypsibius dujardini, and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome for H. dujardini supported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Highlights

  • Advances in high-throughput sequencing technologies are revolutionizing the field of genomics by allowing researchers to generate large amount of data in a short period of time (Loman & Pallen, 2015)

  • Microbiologists often exploit two essential properties of bacterial and archaeal genomes to improve the ‘‘binning’’ step: (1) k-mer frequencies that are somewhat preserved throughout a single microbial genome (Pride et al, 2003) to identify contigs that likely originate from the same genome (Teeling et al, 2004), and (2) a set of genes that occur in the vast majority of bacterial genomes as a single copy to estimate the level of completion and contamination of genome bins (Wu & Eisen, 2008; Campbell et al, 2013; Parks et al, 2015)

  • The URL http://merenlab.org/data/ reports (1) anvi’o files to regenerate Figs. 1 and 2, (2) our curation of the tardigrade genome from Boothby et al.’s assembly, and (3) the FASTA files for bacterial genomes we identified in the raw assemblies from Boothby et al and Koutsovoulos et al RESULTS AND DISCUSSION

Read more

Summary

Introduction

Advances in high-throughput sequencing technologies are revolutionizing the field of genomics by allowing researchers to generate large amount of data in a short period of time (Loman & Pallen, 2015). Microbiologists often exploit two essential properties of bacterial and archaeal genomes to improve the ‘‘binning’’ step: (1) k-mer frequencies that are somewhat preserved throughout a single microbial genome (Pride et al, 2003) to identify contigs that likely originate from the same genome (Teeling et al, 2004), and (2) a set of genes that occur in the vast majority of bacterial genomes as a single copy to estimate the level of completion and contamination of genome bins (Wu & Eisen, 2008; Campbell et al, 2013; Parks et al, 2015) These properties, along with differential coverage of contigs across multiple samples when such data exist, are routinely used to identify coherent microbial draft genomes in metagenomic assemblies (Dick et al, 2009; Albertsen et al, 2013; Wu et al, 2014; Alneberg et al, 2014; Kang et al, 2015; Eren et al, 2015)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call