Abstract

In many cases, biological sequence databases contain redundant sequences that make it difficult to achieve reliable statistical analysis. Removing the redundant sequences to find all the real protein families and their representatives from a large sequences dataset is quite important in bioinformatics. The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics. This paper presents a novel program named FastCluster on the basis of mathematical graph theory. The algorithm makes an improvement to Hobohm and Sander’s algorithm to generate non-redundant protein sequence sets. FastCluster uses BLAST to determine the similarity between two sequences in order to get better sequence similarity. The algorithm’s performance is compared with Hobohm and Sander’s algorithm and it shows that Fast- Cluster can produce a reasonable non-redundant pro- tein set and have a similarity cut-off from 0.0 to 1.0. The proposed algorithm shows its superiority in generating a larger maximal non-redundant (independent) protein set which is closer to the real result (the maximum independent set of a graph) that means all the protein families are clustered. This makes Fast- Cluster a valuable tool for removing redundant protein sequences.

Highlights

  • With the explosion of biological sequence data, many biological sequence databases have redundant sequences which can cause problems for data analysis

  • The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics

  • This paper presents a novel program named FastCluster on the basis of mathematical graph theory

Read more

Summary

Introduction

With the explosion of biological sequence data, many biological sequence databases have redundant sequences which can cause problems for data analysis. These redundant sequences cannot provide valuable information for analysis but detracts from the statistical significance of interesting hits. Processing these redundant sequences often requires more time and computational resources. Removing redundant sequences is undoubtedly very helpful for performing statistical analysis and accelerating extensive database searching [1]. It is necessary to develop an appropriate algorithm to remove redundant sequences from a biological sequence database

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.