Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses.

Shunpu Zhang,Guoqing Lu,Kevin Beland,Zhong Li

doi:10.1186/s12859-016-1147-x

Abstract

BackgroundClustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results.ResultsWe presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92–1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty.ConclusionsWe formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1147-x) contains supplementary material, which is available to authorized users.

Highlights

Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution
We propose a subset bootstrap method, where the practitioner first decides the proportion of the sequence being sampled, and bootstrapping is conducted by randomly choosing this proportion of the nucleic acid bases of the DNA sequences as the subset for re-sampling, while keeping the remaining sequence unchanged
For any given cluster Ci, we evaluate its certainty as follows: Given a pre-determined bootstrapping proportion p and let b be the index of the bootstrap sample from the subset bootstrap sampling, b = 1,...,B

Summary

Introduction

Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Clustering is a common technique used in biology, which partitions molecular sequence data or gene expression data into groups such that the data points are highly similar within group but different between/among groups [1, 2]. Clustering methods are divided into 2 categories: the non-model-based (distance/similarity-based) approaches and the model-based approaches [3, 4]. Model-based clustering techniques can be traced at least as far back as 1963. A review of model-based clustering can be found in [10]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jul 21, 2016
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

A complete analysis of HA and NA genes of influenza A viruses.
Weifeng Shi ... Chaodong Zhu
PLoS ONE | VOL. 5
Weifeng Shi, et. al.Weifeng Shi ... Chaodong Zhu
29 Dec 2010
PLoS ONE | VOL. 5

Hemagglutinin Gene Variation Rate of H9N2 Avian Influenza Virus by Vaccine Intervention in China.
Ying Cao ... Di Liu
Viruses | VOL. 14
Ying Cao, et. al.Ying Cao ... Di Liu
13 May 2022
Viruses | VOL. 14

Relationship of pre-1918 avian influenza HA and NP sequences to subsequent avian influenza strains.
A H Reid ... T G Fanning
Avian diseases | VOL. 47
A H Reid, et. al.A H Reid ... T G Fanning
01 Sep 2003
Relationship of pre-1918 avian influenza HA and NP sequences to subsequent avian influenza strains.
A H Reid ... T G Fanning

Primer development to obtain complete coding sequence of HA and NA genes of influenza A/H3N2 virus.
Agustiningsih Agustiningsih ... Vivi Setiawaty
BMC Research Notes | VOL. 9
Agustiningsih Agustiningsih, et. al.Agustiningsih Agustiningsih ... Vivi Setiawaty
30 Aug 2016
BMC Research Notes | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics