Abstract

Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of “recent” paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. Contact: sarkar@labri.fr.

Highlights

  • The remote homology detection from available protein sequences is one fundamental problem in comparative genomics

  • The probabilistic profiles of logarithmic Evalues generated by local alignment methods like BLASTP or PSIBLAST are recently used for kernel generation instead of sequence encoding itself for protein classification [21]

  • These results show that PSI{BLAST kernel combined with OrthoMCL Neighborhood Mismatch (OMCL NM) and OrthoMCL Mismatch Profile (OMCL MP) kernels after modified symmetry based redistribution (X, XII), consistently outperform other combined kernels with higher ROC50 values

Read more

Summary

Introduction

The remote homology detection from available protein sequences is one fundamental problem in comparative genomics. Detecting remote homologs with subtle sequence similarity still remains a challenging problem. The probabilistic profiles (PSSMs) method (PSIBLAST) [8] exhibits superior performances for remote homology. The discriminative kernel methods with SVMs like mismatch string kernels [6,9], string alignment kernels [10], profile-based direct kernels [11] – exhibited better homology detection. These methods require extensive annotated proteins for training to yield good performances. Incorporating incremental-kernel [14], multi-instance kernel [13] or gapped Markov-feature pairs [15] are the recent approaches for homology detection

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.