Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

Manal Helal,Fanrong Kong,John Potter,Fei Zhou,Vitali Sintchenko,Sharon Ca Chen,Dominic E Dwyer

doi:10.1186/2042-5783-2-2

Abstract

BackgroundComparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets.ResultsA novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear mapping hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset.ConclusionsThe combination of MSA with the linear mapping hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.

Highlights

Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification
The identification of the cluster centroid or the most representative [voucher or barcode] sequence has become an important objective in population biology and taxonomy [3,4,5]
Clustering of gene sequences Using the Multiple Sequence Alignment (MSA) output in the aligned order, the sequences are sorted based on the tree building algorithm used, making the closer family of sequences in order before starting another family branch

Summary

Introduction

Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. Defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work well for the different sequence datasets. Some linkage algorithms are geometric-based and aim at one centroid (e.g., AGglomerative NESting or AGNES) [12], while others (e.g., SLINK) rely on connectivity graph methods producing clusters of proper convex shapes [13]. Defining the optimal number of clusters, cluster density and cluster boundaries for collections of sequences with variable degrees of polymorphism remains a significant challenge [5,17]

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Microbial Informatics and Experimentation	Publication Date: Jan 26, 2012
Citations: 25	License type: cc-by

R Discovery Prime

R Discovery Prime

Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microbial Informatics and Experimentation

Lead the way for us

Similar Papers

Development and validation of consensus clustering-based framework for brain segmentation using resting fMRI.
Srikanth Ryali ... Weidong Cai
Journal of neuroscience methods | VOL. 240
Srikanth Ryali, et. al.Srikanth Ryali ... Weidong Cai
29 Nov 2014
Journal of neuroscience methods | VOL. 240

MARS: improving multiple circular sequence alignment using refined sequences
Lorraine A K Ayad ... Solon P Pissis
BMC Genomics | VOL. 18
Lorraine A K Ayad, et. al.Lorraine A K Ayad ... Solon P Pissis
14 Jan 2017
BMC Genomics | VOL. 18

Dynamic changes of large-scale resting-state functional networks in major depressive disorder
Jiang Zhang ... Jiaojian Wang
Progress in Neuropsychopharmacology & Biological Psychiatry | VOL. 111
Jiang Zhang, et. al.Jiang Zhang ... Jiaojian Wang
29 May 2021
Progress in Neuropsychopharmacology & Biological Psychiatry | VOL. 111

Improving the Dynamic Clustering of Hyperspectral Data Based on the Integration of Swarm Optimization and Decision Analysis
Amin Alizadeh Naeini ... Mohammad Saadatseresht
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | VOL. 7
Amin Alizadeh Naeini, et. al.Amin Alizadeh Naeini ... Mohammad Saadatseresht
01 Jun 2014
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Microbial Informatics and Experimentation