Abstract

SNARE proteins are a group of proteins that drive the biological fusion of two membranes. It is important to identify them accurately, because malfunction of the SNARE proteins can lead to a lot of diseases. In this paper, a Pearson based feature compressing model is proposed to identify the SNARE proteins accurately and efficiently. First, 188D, CKSAAP, CTDD and CTRIAD feature extraction methods are used to extract features from the SNARE and non-SNARE proteins. As the number of features extracted by the four methods is very large, which means many redundant features are included. It is necessary to filter the original feature set. The Chi-Square, Information Gain and Pearson Correlation Coefficient feature selection methods are used to evaluate the value of each feature in the feature set. The selected features are used to train a random forest classifier and the performance of the selected features is evaluated by cross validation. The experimental results showed that the CTDD based model with the first 70% of features selected by the Pearson feature selection method can achieve the best performance among all kinds of models.

Highlights

  • SNARE proteins are a group of proteins that drive the biological fusion of two membranes

  • Experiments show that the performance of the Pearson based on the CTDD feature extraction method can achieve the best performance among all models

  • The contributions of this work include (1) Three kinds of feature selection methods are applied to four kinds of feature sets extracted by four feature extraction methods from the SNARE proteins

Read more

Summary

INTRODUCTION

SNARE proteins are a group of proteins that drive the biological fusion of two membranes. Features are extracted from the SNARE proteins. Some kinds of machine learning algorithms are trained based on the features. This kind of features brings two problems. They increase the complexity of the training algorithm. We test the performance of several feature compressing methods to identify the SNARE proteins accurately and efficiently. As the number of features extracted by the four methods are very large, three kinds of feature compressing methods (Chi-square, Information Gain and Pearson) are used to compress the feature set just extracted. Information gain method orders the value of each feature by calculating how much information the feature can bring to the classification system.

Li: Pearson Based Feature Compressing Model for SNARE Protein Classification
METHODS
FEATURE EXTRACTION METHODS
FEATURE COMPRESSING METHODS
RANDOM FOREST
PERFORMANCE OF DIFFERENT KINDS OF COMPRESSING METHODS FOR THE CKSAAP
PERFORMANCE OF DIFFERENT KINDS OF COMPRESSING METHODS FOR THE CTDD
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.