Abstract

Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.

Highlights

  • A vast amount of protein sequences has been obtained with the development of large-scale sequencing techniques, which need to be classified into structural and functional classes by means of homologies

  • Comparative results of the methods based on sequence composition information In order to compare the proposed sequence-based physicochemical distance transformation (PDT) vectorization approach with other relevant protein remote homology detection methods, the proposed method support vector machines (SVM)-PDT was evaluated on the widely used SCOP 1.53 dataset to give an unbiased comparison with prior methods that are based on sequence composition information

  • As we can see from the figure, the performance increases dramatically when b is less than 4, and turns stable, indicating longer distances between two amino acids along the protein sequences are more important for the discrimination

Read more

Summary

Introduction

A vast amount of protein sequences has been obtained with the development of large-scale sequencing techniques, which need to be classified into structural and functional classes by means of homologies. Other kernels are built by using the sequence features, such as the motifs [11,12,13], mismatch [14], SVM-I-sites [15], SVM-n-peptide [16], N-gram [17], Patterns [18], SVM-BALSA [19] and so on The advantage of these methods is they don’t need computational expensive feature generation step, but their Receiver Operating Characteristic (ROC) scores generally is low, ranging from 0.87 to 0.90 on the standard SCOP 1.53 benchmark. Due to the high computational cost in the feature generation stage, applying these profile-based methods to large-scale remote homology detection is often unfeasible. Some web servers implementing the profile-profile alignment algorithms are available, including COMA[29], PHYRE[30], GenThreader[31], FORTE [32] and webPRC [33]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call