Aegis: a transformer-based deep learning framework for the accurate identification of anticancer peptides.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Anticancer peptides (ACPs) are promising therapeutic agents with selective cytotoxicity toward cancer cells and minimal toxicity toward normal cells. However, the experimental identification and characterization of ACPs are often costly, time-consuming, and inefficient. Computational approaches provide promising alternatives for the rapid and accurate prediction of ACPs. Here, we introduce Aegis, a novel transformer-based deep learning framework designed for precise ACP identification. We systematically evaluated various machine learning and deep learning models via multiple feature extraction methods, including the composition of k-spaced amino acid pairs (CKSAAP), CTD composition (CTDC), CTD transition (CTDT), CTD distribution (CTDD), and pseudo amino acid composition (PAAC) methods. Comprehensive feature importance analyses via analysis of variance (ANOVA), ReliefF, and SHapley Additive exPlanations (SHAP) methods were performed, followed by incremental feature selection (IFS) to determine the optimal subset of discriminative features. Using the 103 optimal features identified via SHAP, Aegis achieves state-of-the-art (SOTA) performance on an independent testing dataset, outperforming existing ACP prediction models. Furthermore, compositional analysis revealed that ACP sequences are significantly enriched in positively charged and hydrophobic residues. Overall, our study demonstrates the exceptional potential of transformer-based deep learning for ACP identification, laying a foundation for future computational screening and the clinical development of novel ACPs.

Similar Papers
  • Research Article
  • Cite Count Icon 22
  • 10.1155/2020/8858489
Succinylation Site Prediction Based on Protein Sequences Using the IFS-LightGBM (BO) Model.
  • Nov 10, 2020
  • Computational and Mathematical Methods in Medicine
  • Lu Zhang + 3 more

Succinylation is an important posttranslational modification of proteins, which plays a key role in protein conformation regulation and cellular function control. Many studies have shown that succinylation modification on protein lysine residue is closely related to the occurrence of many diseases. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. In this study, we develop a new model, IFS-LightGBM (BO), which utilizes the incremental feature selection (IFS) method, the LightGBM feature selection method, the Bayesian optimization algorithm, and the LightGBM classifier, to predict succinylation sites in proteins. Specifically, pseudo amino acid composition (PseAAC), position-specific scoring matrix (PSSM), disorder status, and Composition of k-spaced Amino Acid Pairs (CKSAAP) are firstly employed to extract feature information. Then, utilizing the combination of the LightGBM feature selection method and the incremental feature selection (IFS) method selects the optimal feature subset for the LightGBM classifier. Finally, to increase prediction accuracy and reduce the computation load, the Bayesian optimization algorithm is used to optimize the parameters of the LightGBM classifier. The results reveal that the IFS-LightGBM (BO)-based prediction model performs better when it is evaluated by some common metrics, such as accuracy, recall, precision, Matthews Correlation Coefficient (MCC), and F-measure.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 21
  • 10.3389/fcell.2020.621144
ApoPred: Identification of Apolipoproteins and Their Subfamilies With Multifarious Features.
  • Jan 8, 2021
  • Frontiers in cell and developmental biology
  • Ting Liu + 6 more

Apolipoprotein is a group of plasma proteins that are associated with a variety of diseases, such as hyperlipidemia, atherosclerosis, Alzheimer’s disease, and diabetes. In order to investigate the function of apolipoproteins and to develop effective targets for related diseases, it is necessary to accurately identify and classify apolipoproteins. Although it is possible to identify apolipoproteins accurately through biochemical experiments, they are expensive and time-consuming. This work aims to establish a high-efficiency and high-accuracy prediction model for recognition of apolipoproteins and their subfamilies. We firstly constructed a high-quality benchmark dataset including 270 apolipoproteins and 535 non-apolipoproteins. Based on the dataset, pseudo-amino acid composition (PseAAC) and composition of k-spaced amino acid pairs (CKSAAP) were used as input vectors. To improve the prediction accuracy and eliminate redundant information, analysis of variance (ANOVA) was used to rank the features. And the incremental feature selection was utilized to obtain the best feature subset. Support vector machine (SVM) was proposed to construct the classification model, which could produce the accuracy of 97.27%, sensitivity of 96.30%, and specificity of 97.76% for discriminating apolipoprotein from non-apolipoprotein in 10-fold cross-validation. In addition, the same process was repeated to generate a new model for predicting apolipoprotein subfamilies. The new model could achieve an overall accuracy of 95.93% in 10-fold cross-validation. According to our proposed model, a convenient webserver called ApoPred was established, which can be freely accessed at http://tang-biolab.com/server/ApoPred/service.html. We expect that this work will contribute to apolipoprotein function research and drug development in relevant diseases.

  • Research Article
  • Cite Count Icon 11
  • 10.1038/s41598-024-67433-8
PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation
  • Jul 23, 2024
  • Scientific Reports
  • Muhammad Arif + 3 more

Anticancer peptides (ACPs) perform a promising role in discovering anti-cancer drugs. The growing research on ACPs as therapeutic agent is increasing due to its minimal side effects. However, identifying novel ACPs using wet-lab experiments are generally time-consuming, labor-intensive, and expensive. Leveraging computational methods for fast and accurate prediction of ACPs would harness the drug discovery process. Herein, a machine learning-based predictor, called PLMACPred, is developed for identifying ACPs from peptide sequence only. PLMACPred adopted a set of encoding schemes representing evolutionary-property, composition-property, and protein language model (PLM), i.e., evolutionary scale modeling (ESM-2)- and ProtT5-based embedding to encode peptides. Then, two-dimensional (2D) wavelet denoising (WD) was employed to remove the noise from extracted features. Finally, ensemble-based cascade deep forest (CDF) model was developed to identify ACP. PLMACPred model attained superior performance on all three benchmark datasets, namely, ACPmain, ACPAlter, and ACP740 over tenfold cross validation and independent dataset. PLMACPred outperformed the existing models and improved the prediction accuracy by 18.53%, 2.4%, 7.59% on ACPmain, ACPalter, ACP740 dataset, respectively. We showed that embedding from ProtT5 and ESM-2 was capable of capturing better contextual information from the entire sequence than the other encoding schemes for ACP prediction. For the explainability of proposed model, SHAP (SHapley Additive exPlanations) method was used to analyze the feature effect on the ACP prediction. A list of novel sequence motifs was proposed from the ACP sequence using MEME suites. We believe, PLMACPred will support in accelerating the discovery of novel ACPs as well as other activities of microbial peptides.

  • Research Article
  • Cite Count Icon 58
  • 10.1016/j.chemolab.2018.11.012
UbiSitePred: A novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou's pseudo components
  • Nov 19, 2018
  • Chemometrics and Intelligent Laboratory Systems
  • Xiaowen Cui + 5 more

UbiSitePred: A novel method for improving the accuracy of ubiquitination sites prediction by using LASSO to select the optimal Chou's pseudo components

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 148
  • 10.1186/1471-2105-9-101
Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs
  • Feb 18, 2008
  • BMC Bioinformatics
  • Yong-Zi Chen + 3 more

BackgroundAs one of the most common protein post-translational modifications, glycosylation is involved in a variety of important biological processes. Computational identification of glycosylation sites in protein sequences becomes increasingly important in the post-genomic era. A new encoding scheme was employed to improve the prediction of mucin-type O-glycosylation sites in mammalian proteins.ResultsA new protein bioinformatics tool, CKSAAP_OGlySite, was developed to predict mucin-type O-glycosylation serine/threonine (S/T) sites in mammalian proteins. Using the composition of k-spaced amino acid pairs (CKSAAP) based encoding scheme, the proposed method was trained and tested in a new and stringent O-glycosylation dataset with the assistance of Support Vector Machine (SVM). When the ratio of O-glycosylation to non-glycosylation sites in training datasets was set as 1:1, 10-fold cross-validation tests showed that the proposed method yielded a high accuracy of 83.1% and 81.4% in predicting O-glycosylated S and T sites, respectively. Based on the same datasets, CKSAAP_OGlySite resulted in a higher accuracy than the conventional binary encoding based method (about +5.0%). When trained and tested in 1:5 datasets, the CKSAAP encoding showed a more significant improvement than the binary encoding. We also merged the training datasets of S and T sites and integrated the prediction of S and T sites into one single predictor (i.e. S+T predictor). Either in 1:1 or 1:5 datasets, the performance of this S+T predictor was always slightly better than those predictors where S and T sites were independently predicted, suggesting that the molecular recognition of O-glycosylated S/T sites seems to be similar and the increase of the S+T predictor's accuracy may be a result of expanded training datasets. Moreover, CKSAAP_OGlySite was also shown to have better performance when benchmarked against two existing predictors.ConclusionBecause of CKSAAP encoding's ability of reflecting characteristics of the sequences surrounding mucin-type O-glycosylation sites, CKSAAP_ OGlySite has been proved more powerful than the conventional binary encoding based method. This suggests that it can be used as a competitive mucin-type O-glycosylation site predictor to the biological community. CKSAAP_OGlySite is now available at .

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 38
  • 10.1038/s41598-019-52552-4
Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method
  • Nov 7, 2019
  • Scientific Reports
  • Kai-Yao Huang + 2 more

Succinylation is a type of protein post-translational modification (PTM), which can play important roles in a variety of cellular processes. Due to an increasing number of site-specific succinylated peptides obtained from high-throughput mass spectrometry (MS), various tools have been developed for computationally identifying succinylated sites on proteins. However, most of these tools predict succinylation sites based on traditional machine learning methods. Hence, this work aimed to carry out the succinylation site prediction based on a deep learning model. The abundance of MS-verified succinylated peptides enabled the investigation of substrate site specificity of succinylation sites through sequence-based attributes, such as position-specific amino acid composition, the composition of k-spaced amino acid pairs (CKSAAP), and position-specific scoring matrix (PSSM). Additionally, the maximal dependence decomposition (MDD) was adopted to detect the substrate signatures of lysine succinylation sites by dividing all succinylated sequences into several groups with conserved substrate motifs. According to the results of ten-fold cross-validation, the deep learning model trained using PSSM and informative CKSAAP attributes can reach the best predictive performance and also perform better than traditional machine-learning methods. Moreover, an independent testing dataset that truly did not exist in the training dataset was used to compare the proposed method with six existing prediction tools. The testing dataset comprised of 218 positive and 2621 negative instances, and the proposed model could yield a promising performance with 84.40% sensitivity, 86.99% specificity, 86.79% accuracy, and an MCC value of 0.489. Finally, the proposed method has been implemented as a web-based prediction tool (CNN-SuccSite), which is now freely accessible at http://csb.cse.yzu.edu.tw/CNN-SuccSite/.

  • Research Article
  • Cite Count Icon 270
  • 10.1016/j.jtbi.2013.08.037
Predicting anticancer peptides with Chou′s pseudo amino acid composition and investigating their mutagenicity via Ames test
  • Sep 10, 2013
  • Journal of Theoretical Biology
  • Zohre Hajisharifi + 4 more

Predicting anticancer peptides with Chou′s pseudo amino acid composition and investigating their mutagenicity via Ames test

  • Research Article
  • Cite Count Icon 57
  • 10.1016/j.chemolab.2021.104458
StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach
  • Nov 17, 2021
  • Chemometrics and Intelligent Laboratory Systems
  • Muhammad Arif + 6 more

StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach

  • Research Article
  • Cite Count Icon 90
  • 10.1093/protein/gzp055
Prediction of palmitoylation sites using the composition of k-spaced amino acid pairs
  • Sep 25, 2009
  • Protein Engineering Design and Selection
  • X.-B Wang + 3 more

Palmitoylation is an important hydrophobic protein modification activity that participates many cellular processes, including signaling, neuronal transmission, membrane trafficking and so on. So it is an important problem to identify palmitoylated proteins and the corresponding sites. Comparing with the expensive and time-consuming biochemical experiments, the computational methods have attracted much attention due to their good performances in predicting palmitoylation sites. In this paper, we develop a novel automated computational method to perform this work. For a sequence segment in a given protein, the encoding scheme based on the composition of k-spaced amino acid pairs (CKSAAP) is introduced, and then the support vector machine is used as the predictor. The proposed prediction model CKSAAP-Palm outperforms the existing method CSS-Palm2.0 on both cross-validation experiments and some independent testing data sets. These results imply that our CKSAAP-Palm is able to predict more potential palmitoylation sites and increases research productivity in palmitoylation sites discovery. The corresponding software can be freely downloaded from http://www.aporc.org/doc/wiki/CKSAAP-Palm.

  • Research Article
  • 10.56536/jicet.v5i1.192
Predicting Anticancer Peptides Using Chou's Pseudo Amino Acid Composition Based Features-
  • May 8, 2025
  • Journal of Innovative Computing and Emerging Technologies
  • Mohsin Sami + 2 more

The chain of effective cancer treatments has prompted researchers to explore non-traditional methods such as anti-cancer peptides. These therapists are recently offering potential over traditional therapies and opting for the precision of their products that can best achieve their goal. In our study, we employ support vector machines (SVMs) to construct two predictive models for anti-cancer peptides. SVMs is a powerful algorithms for automatic analysis and is equipped with the ability to process complex data and make precise predictions. To test SVMs, we understand various aspects of the interactions between proteins and anti-cancer peptides. Our model have prompted us to advance research and promote the development of anti-cancer peptides. We utilize Chou pseudo amino acid composition (PseAAC)-based features to enhance the predictive ability of our models. By incorporating these features derived from the peptide sequences, the features that contribute to their anti-cancer activity are captured. Using Chou's PseAAC allowed us to consider the complex structural and financial analyses of peptides to obtain more accurate predictions. Our contribution represents a new and integral method for predicting anticancer peptides, which can unlock the potential of computational engineering in drug design and development.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.31083/j.fbl2703084
Comprehensive Prediction of Lipocalin Proteins Using Artificial Intelligence Strategy.
  • Mar 5, 2022
  • Frontiers in Bioscience-Landmark
  • Hasan Zulfiqar + 6 more

Lipocalin belongs to the calcyin family, and its sequence length is generally between 165 and 200 residues. They are mainly stable and multifunctional extracellular proteins. Lipocalin plays an important role in several stress responses and allergic inflammations. Because the accurate identification of lipocalins could provide significant evidences for the study of their function, it is necessary to develop a machine learning-based model to recognize lipocalin. In this study, we constructed a prediction model to identify lipocalin. Their sequences were encoded by six types of features, namely amino acid composition (AAC), composition of k-spaced amino acid pairs (CKSAAP), pseudo amino acid composition (PseAAC), Geary correlation (GD), normalized Moreau-Broto autocorrelation (NMBroto) and composition/transition/distribution (CTD). Subsequently, these features were optimized by using feature selection techniques. A classifier based on random forest was trained according to the optimal features. The results of 10-fold cross-validation showed that our computational model would classify lipocalins with accuracy of 95.03% and area under the curve of 0.987. On the independent dataset, our computational model could produce the accuracy of 89.90% which was 4.17% higher than the existing model. In this work, we developed an advanced computational model to discriminate lipocalin proteins from non-lipocalin proteins. In the proposed model, protein sequences were encoded by six descriptors. Then, feature selection was performed to pick out the best features which could produce the maximum accuracy. On the basis of the best feature subset, the RF-based classifier can obtained the best prediction results.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.heliyon.2024.e30373
Efficient screening of pharmacological broad-spectrum anti-cancer peptides utilizing advanced bidirectional Encoder representation from Transformers strategy
  • May 1, 2024
  • Heliyon
  • Yupeng Niu + 9 more

Efficient screening of pharmacological broad-spectrum anti-cancer peptides utilizing advanced bidirectional Encoder representation from Transformers strategy

  • Research Article
  • Cite Count Icon 129
  • 10.1093/bib/bbab008
Anticancer peptides prediction with deep representation learning features.
  • Feb 3, 2021
  • Briefings in bioinformatics
  • Zhibin Lv + 4 more

Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed a computational method named identify anticancer peptides via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm and deep representation learning features. Two kinds of sequence embedding technologies were used, namely soft symmetric alignment embedding and unified representation (UniRep) embedding, both of which involved deep neural network models based on long short-term memory networks and their derived networks. The results showed that the use of deep representation learning features greatly improved the capability of the models to discriminate anticancer peptides from other peptides. Also, UMAP (uniform manifold approximation and projection for dimension reduction) and SHAP (shapley additive explanations) analysis proved that UniRep have an advantage over other features for anticancer peptide identification. The python script and pretrained models could be downloaded from https://github.com/zhibinlv/iACP-DRLF or from http://public.aibiochem.net/iACP-DRLF/.

  • Research Article
  • Cite Count Icon 24
  • 10.1186/s12859-018-2394-9
Characterization and identification of lysine glutarylation based on intrinsic interdependence between positions in the substrate sites
  • Feb 1, 2019
  • BMC Bioinformatics
  • Kai-Yao Huang + 4 more

BackgroundGlutarylation, the addition of a glutaryl group (five carbons) to a lysine residue of a protein molecule, is an important post-translational modification and plays a regulatory role in a variety of physiological and biological processes. As the number of experimentally identified glutarylated peptides increases, it becomes imperative to investigate substrate motifs to enhance the study of protein glutarylation. We carried out a bioinformatics investigation of glutarylation sites based on amino acid composition using a public database containing information on 430 non-homologous glutarylation sites.ResultsThe TwoSampleLogo analysis indicates that positively charged and polar amino acids surrounding glutarylated sites may be associated with the specificity in substrate site of protein glutarylation. Additionally, the chi-squared test was utilized to explore the intrinsic interdependence between two positions around glutarylation sites. Further, maximal dependence decomposition (MDD), which consists of partitioning a large-scale dataset into subgroups with statistically significant amino acid conservation, was used to capture motif signatures of glutarylation sites. We considered single features, such as amino acid composition (AAC), amino acid pair composition (AAPC), and composition of k-spaced amino acid pairs (CKSAAP), as well as the effectiveness of incorporating MDD-identified substrate motifs into an integrated prediction model. Evaluation by five-fold cross-validation showed that AAC was most effective in discriminating between glutarylation and non-glutarylation sites, according to support vector machine (SVM).ConclusionsThe SVM model integrating MDD-identified substrate motifs performed well, with a sensitivity of 0.677, a specificity of 0.619, an accuracy of 0.638, and a Matthews Correlation Coefficient (MCC) value of 0.28. Using an independent testing dataset (46 glutarylated and 92 non-glutarylated sites) obtained from the literature, we demonstrated that the integrated SVM model could improve the predictive performance effectively, yielding a balanced sensitivity and specificity of 0.652 and 0.739, respectively. This integrated SVM model has been implemented as a web-based system (MDDGlutar), which is now freely available at http://csb.cse.yzu.edu.tw/MDDGlutar/.

  • Research Article
  • Cite Count Icon 34
  • 10.1021/acs.jproteome.0c00314
IGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier.
  • Oct 22, 2020
  • Journal of Proteome Research
  • Lijun Dou + 4 more

Lysine glutarylation is a newly reported post-translational modification (PTM) that plays significant roles in regulating metabolic and mitochondrial processes. Accurate identification of protein glutarylation is the primary task to better investigate molecular functions and various applications. Due to the common disadvantages of the time-consuming and expensive nature of traditional biological sequencing techniques as well as the explosive growth of protein data, building precise computational models to rapidly diagnose glutarylation is a popular and feasible solution. In this work, we proposed a novel AdaBoost-based predictor called iGlu_AdaBoost to distinguish glutarylation and non-glutarylation sequences. Here, the top 37 features were chosen from a total of 1768 combined features using Chi2 following incremental feature selection (IFS) to build the model, including 188D, the composition of k-spaced amino acid pairs (CKSAAP), and enhanced amino acid composition (EAAC). With the help of the hybrid-sampling method SMOTE-Tomek, the AdaBoost algorithm was performed with satisfactory recall, specificity, and AUC values of 87.48%, 72.49%, and 0.89 over 10-fold cross validation as well as 72.73%, 71.92%, and 0.63 over independent test, respectively. Further feature analysis inferred that positively charged amino acids RK play critical roles in glutarylation recognition. Our model presented the well generalization ability and consistency of the prediction results of positive and negative samples, which is comparable to four published tools. The proposed predictor is an efficient tool to find potential glutarylation sites and provides helpful suggestions for further research on glutarylation mechanisms and concerned disease treatments.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.