Abstract
Background: Identifying disease-protein associations is a key step in treating disease, understanding pathomechanisms, and developing drugs. Although experimental methods can be used to identify disease-protein associations, they are often time-consuming, laborious, and expensive. Therefore, there is a strong need to develop theoretical computational methods to identify potential disease-protein associations. Objective: This work aimed to study the effect of the graph embedding algorithm and reliable negative sample screening methods on predicting disease-protein association. Methods: In our study, information on disease similarity, disease-protein association, and proteinprotein interaction was used to construct a heterogeneous network, including protein-protein interaction subnetwork, disease similarity subnetwork, and disease-protein association subnetwork. Then, a graph embedding algorithm was utilized to obtain network node features to characterize the disease-protein relationships. The support vector data description algorithm was applied to screen the reliable negative samples. Finally, random forest algorithm was employed to construct a model for identifying potential disease-protein associations. Results: The present method achieved an accuracy of 94.55%, a specificity of 98.49%, a precision of 98.36%, a Matthew's correlation coefficient of 0.8938, an area under the receiver operating characteristic curve of 0.9815, and an area under the precision-recall curve of 0.9591, based on a constructed benchmark dataset and a 10-fold cross-validation test. Results from a series of nonredundant datasets and an independent test dataset showed our method to be robust for data redundancy and that it can accurately identify disease-related proteins, protein-related diseases, and potential disease-protein associations. Based on the constructed model, the large-scale prediction study identified more than 1.7 million potential disease-protein association pairs with a probability greater than 99%. The top five predicted disease-protein association pairs were further confirmed by literature and molecular docking simulations. Conclusion: Extensive experimental results showed that the proposed method can effectively identify potential disease-protein associations. It is expected that the current method can help not only in understanding disease mechanisms at the protein level, but also in discovering new protein targets and potential small molecule drugs.
Published Version
Join us for a 30 min session where you can share your feedback and ask us any queries you have