Abstract
The identification of compound-protein interactions plays key roles in the drug development toward discovery of new drug leads and new therapeutic protein targets. There is therefore a strong incentive to develop new efficient methods for predicting compound-protein interactions on a genome-wide scale. In this paper we develop a novel chemogenomic method to make a scalable prediction of compound-protein interactions from heterogeneous biological data using minwise hashing. The proposed method mainly consists of two steps: 1) construction of new compact fingerprints for compound-protein pairs by an improved minwise hashing algorithm, and 2) application of a sparsity-induced classifier to the compact fingerprints. We test the proposed method on its ability to make a large-scale prediction of compound-protein interactions from compound substructure fingerprints and protein domain fingerprints, and show superior performance of the proposed method compared with the previous chemogenomic methods in terms of prediction accuracy, computational efficiency, and interpretability of the predictive model. All the previously developed methods are not computationally feasible for the full dataset consisting of about 200 millions of compound-protein pairs. The proposed method is expected to be useful for virtual screening of a huge number of compounds against many protein targets.
Highlights
The identification of compound-protein interactions is an important part in the drug development toward discovery of new drug leads and new therapeutic protein targets
The state-of-the-art in the chemogenomic approach is to built the chemogenomic space of compound-protein pairs as the tensor product of the chemical space of compounds and the genomic space of proteins, and analyze compound-protein pairs by machine learning classifiers such as support vector machine (SVM) [4,5,6,7,8]
In this paper we develop a novel chemogenomic method to make a scalable prediction of compoundprotein interactions from heterogeneous biological data using minwise hashing, which is applicable for virtual screening of a huge number of compounds against many human proteins
Summary
The identification of compound-protein interactions is an important part in the drug development toward discovery of new drug leads and new therapeutic protein targets. The input of the SVM method in most previous works is the pairwise kernel similarity matrix of compound-protein pairs, which makes it difficult to analyze large-scale data. In this study we consider representing compound-protein pairs by the fingerprints to use them as inputs of linear SVM, because the linear SVM provides us with interpretable predictive models and works well for superhigh dimensional data [16]. In this paper we develop a novel chemogenomic method to make a scalable prediction of compoundprotein interactions from heterogeneous biological data using minwise hashing, which is applicable for virtual screening of a huge number of compounds against many human proteins. The proposed method mainly consists of two steps: 1) construction of new compact fingerprints for compound-protein pairs by an improved minwise hashing algorithm, and 2) application of the linear SVM to the compact fingerprints. Each protein was represented by a domain fingerprint (binary vector) whose elements encode for the presence or absence of each of the retained 4,137 PFAM domains by 1 or 0, respectively
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.