The role of in-silico computational methods in identifying protein-protein interactions (PPIs) between target and host proteins is crucial for developing effective infection treatments. These methods are essential for quickly determining high-quality and accurate PPIs, predicting protein pairs with the highest likelihood of physical interaction from a large pool, and reducing the need for experimental confirmation or prioritizing pairs for experiments. This study proposes using gene ontology and natural language processing (NLP) approaches to extract and quantify features from protein sequences. In the first step, proteins were represented using gene ontology terms, and a set of features was generated. In the second step, NLP techniques treated gene ontology terms as a word dictionary, creating numerical vectors using the bag of words (BoW), count vector, term frequency-inverse document frequency (TF-IDF), and information content methods. In the third step, different machine learning methods, including Decision Tree, Random Forest, Bagging-RepTree, Bagging-RF, BayesNet, Deep Neural Network (DNN), Logistic Regression, Support Vector Machine (SVM), and VotedPerceptron, were employed to predict protein interactions in the datasets. In the fourth step, the Max-Min Parents and Children (MMPC) feature selection algorithm was applied to improve predictions using fewer features. The performance of the developed method was tested on the SARS-CoV-2 protein interaction dataset. The MMPC algorithm reduced the feature count by over 99%, enhancing protein interaction prediction. After feature selection, the DNN method achieved the highest predictive performance, with an AUC of 0.878 and an F-Measure of 0.793. Sequence-based protein encoding methods AAC, APAAC, CKSAAPP, CTriad, DC, and PAAC were applied to proteins in the SARS-CoV-2 interaction dataset and their performance was compared with GO-NLP. The performance of the relevant methods was measured separately and combined. The highest performance was obtained from the combined dataset with an AUC value of 0.888. This study demonstrates that the proposed gene ontology and NLP approach can successfully predict protein-protein interactions for antiviral drug design with significantly fewer features using the MMPC-DNN model.
Read full abstract