Clustering biological sequences with dynamic sequence similarity threshold

Jimmy Ka Ho Chiu,Rick Twee-Hee Ong

doi:10.1186/s12859-022-04643-9

Abstract

BackgroundBiological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences.ResultsWe present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained.ConclusionsALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 30, 2022
Citations: 7	License type: open-access

R Discovery Prime

R Discovery Prime

Clustering biological sequences with dynamic sequence similarity threshold

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

M-pick, a modularity-based method for OTU picking of 16S rRNA sequences
Xiaoyu Wang ... Jin Yao
BMC Bioinformatics | VOL. 14
Xiaoyu Wang, et. al.Xiaoyu Wang ... Jin Yao
07 Feb 2013
BMC Bioinformatics | VOL. 14

TreeCluster: Clustering biological sequences using phylogenetic trees
Niema Moshiri ... Siavash Mirarab
-
Niema Moshiri, et. al.Niema Moshiri ... Siavash Mirarab
22 Aug 2019
22 Aug 2019

TreeCluster: Clustering biological sequences using phylogenetic trees.
Metin Balaban ... Xingfan Jia
PLOS ONE | VOL. 14
Metin Balaban, et. al.Metin Balaban ... Xingfan Jia
22 Aug 2019
PLOS ONE | VOL. 14

A Model of Opinion Dynamics for Community Detection in Graphs
Irinel-Constantin Morărescu ... Antoine Girard
IFAC Proceedings Volumes | VOL. 43
Irinel-Constantin Morărescu, et. al.Irinel-Constantin Morărescu ... Antoine Girard
01 Jan 2009
IFAC Proceedings Volumes | VOL. 43

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering biological sequences with dynamic sequence similarity threshold

Abstract

Talk to us

Similar Papers

More From: BMC Bioinformatics