Abstract

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

Highlights

  • Metagenome sequencing holds the key to comprehensively understand the structure, dynamics and interactions of underlying microbial communities and their implication to health and environment (Chiu & Miller, 2019; Tringe & Rubin, 2005; Thomas, Gilbert & Meyer, 2012)

  • We previously reported that an Apache SparkTM-based read clustering method, SpaRC, that showed a great potential in achieving good scalability and clustering performance (Shi et al, 2018)

  • Global clustering greatly improves short read clustering performance In order to test whether multiple samples derived from the same microbial community could be leveraged to improve short read clustering performance, we designed a control dataset by taking 10% of the reads from 50 samples from the CAMI2 synthetic metagenome dataset (Materials and Methods)

Read more

Summary

Introduction

Metagenome sequencing holds the key to comprehensively understand the structure, dynamics and interactions of underlying microbial communities and their implication to health and environment (Chiu & Miller, 2019; Tringe & Rubin, 2005; Thomas, Gilbert & Meyer, 2012). Except for a few cases (Brown et al, 2017), the majority of metagenome sequencing projects relied on cost-effective, short-read sequencing technologies These projects routinely produce a huge amount of data of 100–1,000 giga-bases (Gb) or more (Howe et al, 2014; Shi et al, 2014).

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call