Abstract

BackgroundClassification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.ResultsFocusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method.ConclusionsCertain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.

Highlights

  • Classification of certain proteins with specific functions is momentous for biological research

  • Feature extraction from protein sequences plays an important role in protein classification [1,2,3,4] of many areas, such as identification of plant pentatricopeptide repeat coding protein [5], prediction of bacterial type IV secreted effectors [6,7,8,9], identification of heat shock protein [10], prediction of mitochondrial proteins [11], etc

  • Several problems do still exist and are listed as follows. It needs to be decided which encoding approach is more effective. This problem can be solved according to the results of classification on each encoding approach using various computational methods

Read more

Summary

Results

We take a benchmark dataset [51] as a case to evaluate the performance of our proposed method. Results of ensemble classification on testing group The ensemble classifiers have been built using 1000 rounds of resampling and training on the training set in each dimension, with 400 variables incrementally added in the descending order according to their accumulated scores. 4b and 5, it can be seen that ensemble classification with automatic assignment of base classifier keeps a better ROC curve and AUC value (i.e., 0.9287). It can be seen that the ensemble classifier with automatic assignment of a base classifier on the 25 selected variables keeps a high TP (i.e., 157) compared with most of the other classification strategies. Anyway, seeking better classification results by setting parameter values doesn’t make any sense for variable selection

Conclusions
Background
Conclusion
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call