FastCluster: a graph theory based algorithm for removing redundant sequences

Peng-Fei Liu,Zhen-Bing Zeng,Sheng-Yu Ni,Chang-Hong Lu,Liu-Huan Dong,Wen-Cong Lu,Jin-Long Shu,Zi-Liang Qian,Yu-Dong Cai

doi:10.4236/jbise.2009.28090

Peng-Fei Liu, Zhen-Bing Zeng + Show 7 more

Open Access

https://doi.org/10.4236/jbise.2009.28090

Copy DOI

Abstract

In many cases, biological sequence databases contain redundant sequences that make it difficult to achieve reliable statistical analysis. Removing the redundant sequences to find all the real protein families and their representatives from a large sequences dataset is quite important in bioinformatics. The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics. This paper presents a novel program named FastCluster on the basis of mathematical graph theory. The algorithm makes an improvement to Hobohm and Sander’s algorithm to generate non-redundant protein sequence sets. FastCluster uses BLAST to determine the similarity between two sequences in order to get better sequence similarity. The algorithm’s performance is compared with Hobohm and Sander’s algorithm and it shows that Fast- Cluster can produce a reasonable non-redundant pro- tein set and have a similarity cut-off from 0.0 to 1.0. The proposed algorithm shows its superiority in generating a larger maximal non-redundant (independent) protein set which is closer to the real result (the maximum independent set of a graph) that means all the protein families are clustered. This makes Fast- Cluster a valuable tool for removing redundant protein sequences.

Highlights

With the explosion of biological sequence data, many biological sequence databases have redundant sequences which can cause problems for data analysis
The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics
This paper presents a novel program named FastCluster on the basis of mathematical graph theory

Summary

Introduction

With the explosion of biological sequence data, many biological sequence databases have redundant sequences which can cause problems for data analysis. These redundant sequences cannot provide valuable information for analysis but detracts from the statistical significance of interesting hits. Processing these redundant sequences often requires more time and computational resources. Removing redundant sequences is undoubtedly very helpful for performing statistical analysis and accelerating extensive database searching [1]. It is necessary to develop an appropriate algorithm to remove redundant sequences from a biological sequence database

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Biomedical Science and Engineering	Publication Date: Jan 1, 2009
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

FastCluster: a graph theory based algorithm for removing redundant sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Science and Engineering

Lead the way for us

Similar Papers

A Graph Theoretic Algorithm for Removing Redundant Protein Sequences
Pengfei Liu ... Kaiyan Feng
-
Pengfei Liu, et. al.Pengfei Liu ... Kaiyan Feng
01 Jun 2009
01 Jun 2009

RSDB: representative protein sequence databases have high information content
Jong Park ... Liisa Holm
Bioinformatics | VOL. 16
Jong Park, et. al.Jong Park ... Liisa Holm
01 May 2000
Bioinformatics | VOL. 16

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site.
Henrik Nielsen ... Jacob Engelbrecht
Proteins: Structure, Function, and Genetics | VOL. 24
Henrik Nielsen, et. al.Henrik Nielsen ... Jacob Engelbrecht
01 Feb 1996
Proteins: Structure, Function, and Genetics | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FastCluster: a graph theory based algorithm for removing redundant sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Science and Engineering