RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

Xiaoming Xu,Weiguo Liu,Borui Xu,Lifeng Yan,Yanjie Wei,Beifang Niu,Zekun Yin,Hao Zhang,Bertil Schmidt

doi:10.1186/s13059-023-02961-6

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

Xiaoming Xu, Weiguo Liu + Show 7 more

Open Access

https://doi.org/10.1186/s13059-023-02961-6

Copy DOI

Journal: Genome Biology	Publication Date: May 17, 2023
Citations: 4	License type: open-access

Affiliation: Shandong University, City University of Hong Kong, Shenzhen Research Institute, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Computer Network Information Center, Johannes Gutenberg University Mainz

#Complete Bacterial Genome Sequences #FASTA Format + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.

Full Text