Abstract
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
Highlights
The first complete bacterial genome was published more than 20 years ago [1]
We tested the performance of Gclust using four RefSeq genome datasets
We present an open source program for clustering microbial genomic sequences
Summary
The first complete bacterial genome was published more than 20 years ago [1]. As of the beginning of 2018, the Genomes OnLine Database Most genomic studies have been focusing on microbial species, especially bacteria. The growth of publically available bacterial genomes have become substantial and the amount of such data pose significant challenges for researchers interested in using these resources efficiently. These databases host a large portion of redundant genomes from the same or closely related species and the redundancy has to be reduced
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.