A KNN Model Based on Manhattan Distance to Identify the SNARE Proteins

Xing Gao,Guilin Li

doi:10.1109/access.2020.3003086

Abstract

SNARE proteins, known as membrane fusion proteins, play a primary role to mediate vesicle fusion. Loss of function of the SNARE protein can lead to a variety of diseases. A method to accurately identify the SNARE protein is important and necessary. In this paper, we try different kinds of combinations of sampling methods (the resampling, SMOTE and no sampling), feature extraction approaches (the 188D, K-skip-2-gram and CKSAAP) and distance measurements (Chebyshev distance, Euclidean distance, Manhattan distance and Minkowski distance) to find a suitable model for identifying the SNARE proteins. By doing extensive experiments, we construct a Manhattan distance based KNN model by combining the CKSAAP feature extraction approach with no sampling method, which achieves the best identification performance among all combinations. Finally, we compare our KNN based model with a deep learning based model (called SNARE-CNN) from SN, SP, ACC and MCC four aspects, the experimental results show that the performance of our model is better than that of the SNARE-CNN.

Highlights

SNARE proteins, known as membrane fusion proteins, play a primary role to mediate vesicle fusion [1], [2]
We propose a machine learning model based on K Nearest Neighbor (KNN) algorithm [32] to accurately identify the SNARE proteins
(2) Experimental results show that the performance of the KNN model based on Manhattan distance, no sampling method and CKSAAP feature extraction approach is the best one among all models

Summary

INTRODUCTION

SNARE proteins, known as membrane fusion proteins, play a primary role to mediate vesicle fusion [1], [2]. We propose a machine learning model based on KNN algorithm [32] to accurately identify the SNARE proteins. We find that the performance no sampling method is always better than the resampling and SMOTE methods for different feature extraction approaches and distance measurements. We constructed a KNN model by combining the Manhattan distance, CKSAAP feature extraction approach with no sampling method, which achieves the best identification performance. The contributions of this work include (1) Extensive experiments are done to test the performance of different feature extraction methods, sampling methods and distance measurements of the KNN algorithm to identify the SNARE proteins. (2) Experimental results show that the performance of the KNN model based on Manhattan distance, no sampling method and CKSAAP feature extraction approach is the best one among all models.

METHODS

DATASET

FEATURE EXTRACTION METHODS

KNN ALGORITHM

CONCLUSION