Abstract

The rapid increase of available proteins, DNA and other biological sequences has made the problem of discovering the meaningful patterns from sequences, a major task for Bioinformatics research. Data mining of protein sequence databases poses special challenges, because several protein databases are non-relational whereas most of the data mining and machine learning techniques considers the data input to be a relational database. The existing sequence mining algorithms mainly focus on mining for subsequences. However, a wide range of applications such as biological DNA and protein motif mining needs an effective mining for identifying the approximate frequent patterns. The existing approximate frequent pattern mining algorithms have some delimitations such as lack of knowledge to finding the patterns, poor scalability and complexity to adapt into some other applications. In this paper, a Generalized Approximate Pattern Algorithm (GAPA) is proposed to efficiently mine the approximate frequent patterns in the protein sequence database. Pearson’s coefficient correlation is computed among the protein sequence database items to analyze the approximate frequent patterns. The performance of the proposed GAPA is analyzed and tested with the FASTA protein sequence database. FASTA database files hold the protein translations of Ensembl gene predictions. GAPA is compared with the existing methods such as Approximate Frequent Itemsets (AFI) tree and Approximate Closed Frequent Itemsets (ACFIM) in terms of support, accuracy, memory usage and time consumption. The experimental results shows GAPA is scalable and outperforms than the existing algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call