Abstract
Protein fold recognition plays an important role in computational protein analysis since it can determine protein function whose structure is unknown. In this paper, a Classified Sequential Pattern mining technique for Protein Fold Recognition (CSPF) is proposed. CSPF technique consists of two main phases: the sequential mining pattern phase and the fold recognition phase. In the sequential mining pattern phase, Mix & Test algorithm is developed based on Grammatical Inference, which is used as a training phase. Mix & Test algorithm minimizes I/O costs by one database scan, discovers subsequence combinations directly from sequences in memory without searching the whole sequence file, has no database projection, handles gaps, and works with variant length sequences without having to align them. In addition, a parallelized version of Mix & Test algorithm is applied to speed up Mix & Test algorithm performance. In the fold recognition phase, unknown protein folds are predicted via a proposed testing function. To test the performance, 36 SCOP protein folds are used, where the accuracy rate is 75.84% for training data and 59.7% for testing data.
Highlights
Protein fold recognition is an important step towards understanding protein three-dimensional structures and their biological functions
We introduce a Classified Sequential Pattern mining technique for Protein Fold Recognition (CSPF)
We proposed a CSFP technique for protein fold recognition
Summary
Protein fold recognition is an important step towards understanding protein three-dimensional structures and their biological functions. Sequential mining algorithms have been proposed to predict protein folds. One of the SPADE based algorithm called SPAM (Sequential PAttern Mining) [39] has been proposed. GI is used as the backbone of the sequential pattern mining algorithm, which has achieved faster and higher performance accuracy than other sequential pattern mining algorithms for protein fold recognition. We introduce a Classified Sequential Pattern mining technique for Protein Fold Recognition (CSPF). CSPF consists of two main phases: 1) Sequential pattern mining and 2) fold recognition. It handles gap constraints, uses data parallelization, and performs incremental updating.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advanced Computer Science and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.