The interactions of proteins to form complexes play a crucial role in cell function. Data on protein-protein or pairwise interactions (PPI) typically come from a combination of sample separation and mass spectrometry. Since 2010, several extensive, high-throughput mass spectrometry-based experimental studies have dramatically expanded public repositories for PPI data and, by extension, our knowledge of protein complexes. Unfortunately, challenges of limited overlap between experiments, modality-oriented biases, and prohibitive costs of experimental reproducibility continue to limit coverage of the human protein assembly map, both underscoring the need for and spurring the development of relevant computational approaches. Here, we present a new method for predicting the strength of protein interactions. It addresses two important issues that have limited past PPI prediction approaches: incomplete feature sets and incomplete proteome coverage. For a given collection of protein pairs, we fused data from heterogeneous sources into a feature matrix and identified the minimal set of feature partitions for which a non-empty set of protein pairs had complete values. For each such feature partition, we trained a classifier to predict PPI probabilities. We then calculated an overall prediction for a given protein pair by weighting the probabilities from all models that applied to that pair. Our approach accurately identified known and highly probable PPI, far exceeding the performance of current approaches and providing more complete proteome coverage. We then used the predicted probabilities to assemble complexes using previously-described graph-based tools and clustering algorithms and again obtained improved results. Lastly, we used features for three human cell lines to predict PPI and complex scores and identified complexes predicted to differ between those cell lines.
Read full abstract