An empirical study on the matrix-based protein representations and their combination with sequence-based approaches

Loris Nanni,Alessandra Lumini,Sheryl Brahnam

doi:10.1007/s00726-012-1416-6

Abstract

Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper, we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representations are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein) ×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improves the performance of standard descriptors based on the PSSM representation. Experimentally, we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper are available at http://www.bias.csr.unibo.it\nanni\PSSM.rar.

Highlights

The explosion of protein sequences generated in the postgenomic era has not been followed by an equal increase in the knowledge of protein biological attributes, which are essential for basic research and drug development
Notice that the representation method DM is not included in this table; this is because it is available only in a subset of datasets
Given the results reported above, our proposed ensemble FUS1 should prove useful for practitioners and experts alike since it can form the base for building systems that are optimized for particular problems (e.g., support vector machine (SVM) optimization and physicochemical properties selection)

Summary

Introduction

The explosion of protein sequences generated in the postgenomic era has not been followed by an equal increase in the knowledge of protein biological attributes, which are essential for basic research and drug development. Since manual classification of proteins by means of biological experiments is both time-consuming and costly, much effort has been applied to the problem of automating this process using various machine learning algorithms and computational tools for fast and effective classification of proteins given their sequence information [1]. According to [2], a process designed to predict an attribute of a protein based on its sequence generally involves the following procedures: (1) constructing a benchmark dataset for testing and training machine learning predictors, (2) formulating a protein representation based on a discrete numerical model that is correlated with the attribute to predict, (3) proposing a powerful machine learning approach to perform the prediction, (4). The most widely used sequential model is based on the entire amino-acid sequence of a protein, expressed by the sequence of its residues, with each one belonging to one of the 20 native amino-acid types:

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Amino Acids	Publication Date: Oct 30, 2012
Citations: 40	License type: cc-by

R Discovery Prime

An empirical study on the matrix-based protein representations and their combination with sequence-based approaches

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Amino Acids

Lead the way for us

Similar Papers

Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition
Loris Nanni ... Alessandra Lumini
Journal of Theoretical Biology | VOL. 360
Loris Nanni, et. al.Loris Nanni ... Alessandra Lumini
12 Jul 2014
Journal of Theoretical Biology | VOL. 360

A set of descriptors for identifying the protein–drug interaction in cellular networking
Loris Nanni ... Sheryl Brahnam
Journal of Theoretical Biology | VOL. 359
Loris Nanni, et. al.Loris Nanni ... Sheryl Brahnam
17 Jun 2014
Journal of Theoretical Biology | VOL. 359

Prediction of Nicotinamide Adenine Dinucleotide Interacting Sites Based on Ensemble Support Vector Machine
Xia Wang ... Meng-Long Li
Protein & Peptide Letters | VOL. 19
Xia Wang, et. al.Xia Wang ... Meng-Long Li
01 Apr 2012
Protein & Peptide Letters | VOL. 19

Construct support vector machine ensemble to detect traffic incident
Shuyan Chen ... Henk Van Zuylen
Expert Systems with Applications | VOL. 36
Shuyan Chen, et. al.Shuyan Chen ... Henk Van Zuylen
20 Feb 2009
Expert Systems with Applications | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

An empirical study on the matrix-based protein representations and their combination with sequence-based approaches

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Amino Acids