IAMY‐DC: Identifying Amyloid Proteins by Using Dynamic Correlation Features

Hongliang Zou

doi:10.1002/slct.202204629

Abstract

AbstractRecent studies reported that amyloid proteins keep a closely relationship with some common diseases, such as Alzhemier's disease, Parkinson's disease, and type 2 diabetes. In view of this, it is an urgent task to discriminate amyloid proteins from non‐amyloid proteins. In this work, we developed a new machine learning model to identify amyloid proteins based on the sequence information. Firstly, fifty different kinds of physicochemical (PC) properties were employed to denote sequences. Then, a sliding window approach was adopted to capture the local correlation information based on Pearson's correlation coefficient. And the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm was used to select these most distinguishing features. Given that the number of negative samples larger than the number of positive samples, the popular synthetic minority oversampling technique (SMOTE) algorithm was utilized to solve the unbalanced dataset. Experiments were performed on support vector machine by using jackknife test. Compared with the existing predictors, experimental results showed that the proposed method has significantly improvement in distinguishing amyloid from non‐amyloid proteins. The dataset and codes used in this study were available at https://figshare.com/articles/online_resource/iAMY‐DC/20268093.

Full Text