Abstract

BackgroundAdvances in sequencing, assembly, and assortment of contigs into species-specific bins has enabled the reconstruction of genomes from metagenomic data (MAGs). Though a powerful technique, it is difficult to determine whether assembly and binning techniques are accurate when applied to environmental metagenomes due to a lack of complete reference genome sequences against which to check the resulting MAGs.MethodsWe compared MAGs derived from an enrichment culture containing ~20 organisms to complete genome sequences of 10 organisms isolated from the enrichment culture. Factors commonly considered in binning software—nucleotide composition and sequence repetitiveness—were calculated for both the correctly binned and not-binned regions. This direct comparison revealed biases in sequence characteristics and gene content in the not-binned regions. Additionally, the composition of three public data sets representing MAGs reconstructed from the Tara Oceans metagenomic data was compared to a set of representative genomes available through NCBI RefSeq to verify that the biases identified were observable in more complex data sets and using three contemporary binning software packages.ResultsRepeat sequences were frequently not binned in the genome reconstruction processes, as were sequence regions with variant nucleotide composition. Genes encoded on the not-binned regions were strongly biased towards ribosomal RNAs, transfer RNAs, mobile element functions and genes of unknown function. Our results support genome reconstruction as a robust process and suggest that reconstructions determined to be >90% complete are likely to effectively represent organismal function; however, population-level genotypic heterogeneity in natural populations, such as uneven distribution of plasmids, can lead to incorrect inferences.

Highlights

  • High-throughput sequencing has revolutionized microbiology by allowing investigation of natural communities in a culture-independent manner (Long et al 2016; White et al 2016b; Zhou et al 2015)

  • Two common elements of current sequence segregation protocols are analysis of sequence composition and comparison of coverage profiles between samples, so we examined the nucleotide content of MDRs vs CDRs as both %G+C and tetranucleotide content, and the multiplicity of sequence information both within the individual genome and across the entire metagenomic data set

  • The genomic dataset used in this study was collected from two unicyanobacterial consortia and ten organisms isolated therefrom, and consisted of draft or full genome sequence for the ten isolates and cognate genome reconstructions from consortial metagenomic sequence (Nelson et al 2015; Romine et al 2017)

Read more

Summary

Introduction

High-throughput sequencing has revolutionized microbiology by allowing investigation of natural communities in a culture-independent manner (Long et al 2016; White et al 2016b; Zhou et al 2015). ( referred to as ‘genomes from metagenomes’), the process of segregating either sequence reads or assembled contigs and scaffolds into organism-specific ‘bins’, has benefitted from continuing advances in sequencing technologies, sequence assembly algorithms and segregation methods (Sangwan et al 2016), from early successes assembling genomes from a simple community (Tyson et al 2004) to more recent studies reconstructing many organisms from complex environments (Alneberg et al 2014; Anantharaman et al 2016; Baker et al 2015; Brown et al 2015; Graham et al 2016; Kang et al 2015; Li et al 2015; Li et al 2016; Nobu et al 2015; Wu et al 2016) One shortcoming of this approach is that while it is generally possible to assemble metagenomic data and segregate it into bins, determining the correctness and completeness of the final product has been impossible in almost all cases because of the lack of appropriate reference genomes for environmental samples. It is shadowed by an inability to truly determine whether assembly and binning techniques are accurate, specific, and sensitive due to a lack of complete reference genome sequences against which to check the data.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.