Abstract
Reconstruction of host-pathogen protein interaction networks is of great significance to reveal the underlying microbic pathogenesis. However, the current experimentally-derived networks are generally small and should be augmented by computational methods for less-biased biological inference. From the point of view of computational modelling, data scarcity, data unavailability and negative data sampling are the three major problems for host-pathogen protein interaction networks reconstruction. In this work, we are motivated to address the three concerns and propose a probability weighted ensemble transfer learning model for HIV-human protein interaction prediction (PWEN-TLM), where support vector machine (SVM) is adopted as the individual classifier of the ensemble model. In the model, data scarcity and data unavailability are tackled by homolog knowledge transfer. The importance of homolog knowledge is measured by the ROC-AUC metric of the individual classifiers, whose outputs are probability weighted to yield the final decision. In addition, we further validate the assumption that only the homolog knowledge is sufficient to train a satisfactory model for host-pathogen protein interaction prediction. Thus the model is more robust against data unavailability with less demanding data constraint. As regards with negative data construction, experiments show that exclusiveness of subcellular co-localized proteins is unbiased and more reliable than random sampling. Last, we conduct analysis of overlapped predictions between our model and the existing models, and apply the model to novel host-pathogen PPIs recognition for further biological research.
Highlights
Accurate mapping of protein interactome is essential to reveal protein functions, biological processes, signal transduction pathways
The work [14] explained the reasons why gene ontology (GO) feature outperformed the other feature information based on the observations: (1) proteins localized in identical cellular compartments are more likely to interact than are proteins that reside in spatially distant compartments; (2) proteins that participate in similar biological processes or perform similar molecular functions are likely to interact
Data unavailability and negative data sampling are the three major concerns to be addressed for the computational reconstruction of HIV-human protein-protein interactions (PPI) networks
Summary
Accurate mapping of protein interactome is essential to reveal protein functions, biological processes, signal transduction pathways. Wuchty S [10] combined sequence k-mer, interlog, gene ontology and signal transduction pathways to predict and validate the protein interactions between Plasmodium falciparum and Homo sapiens. In the latter two models, the validation information (gene co-expression, signal transduction pathways, gene ontology) was used to manually filter the predicted PPIs. It has been claimed that gene ontology (GO) is one of the strongest indicators for host-pathogen PPI prediction [6] and intra-species PPI prediction [3,4,11,12,13,14,15,16,17] among the catalog of feature information. The three aspects of gene ontology (cellular compartments, biological processes and molecular functions) are informative to indicate PPI
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.