Abstract

Protein remote homology detection is a key problem in bioinformatics. Currently, the discriminative methods, such as Support Vector Machine (SVM), can achieve the best performance. The most efficient approach to improve the performance of the SVM-based methods is to find a general protein representation method that is able to convert proteins with different lengths into fixed length vectors and captures the different properties of the proteins for the discrimination. The bottleneck of designing the protein representation method is that native proteins have different lengths. Motivated by the success of the pseudo amino acid composition (PseAAC) proposed by Chou, we applied this approach for protein remote homology detection. Some new indices derived from the amino acid index (AAIndex) database are incorporated into the PseAAC to improve the generalization ability of this method. Our experiments on a well-known benchmark show this method achieves superior or comparable performance with current state-of-the-art methods.

Highlights

  • Protein remote homology detection, referring to the detection of evolutional homology in proteins with low similarities, is a challenging problem in bioinformatics, which has been intensively studied for a decade

  • Protein remote homology detection is a key problem in bioinformatics

  • Motivated by the success of the pseudo amino acid composition (PseAAC) proposed by Chou, we applied this approach for protein remote homology detection

Read more

Summary

Introduction

Protein remote homology detection, referring to the detection of evolutional homology in proteins with low similarities, is a challenging problem in bioinformatics, which has been intensively studied for a decade. Different from the generative methods, the discriminative methods lean a combination of the features that can discriminate the protein families. Among these methods, the top-performing methods use the support vector machines (SVM) [4] to build the discriminative framework. LA kernel [5] measures the similarity between a pair of proteins by taking all the optimal local alignment scores with gaps between all possible subsequences into account. Some top-performing methods employ the evolutional information extracted from the profiles These methods need an additional alignment step to generate the profiles by searching against a non-redundant database, which leads to higher computational cost. Top-n-grams extract the profile-based patterns by considering the most frequent elements in the profiles [8]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.