Abstract

MotivationProtein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation.MethodWe use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models—dubbed PPI-BioBERT-x10—to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions.Results and conclusionThe PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter approx 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

Highlights

  • Critical biological processes, such as signaling cascades and metabolism, are regulated by protein-protein interactions (PPIs) that modify other proteins in order to modulate their stability or activity via post-translational modifications (PTMs)

  • We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter ≈ 5700 (4584 unique) high confidence predictions

  • We created a distant supervised training dataset for extracting PTM-PPIs, including annotation for phosphorylation, dephosphorylation, methylation, demethylation, ubiquitination, and acetylation from PubMed abstracts by leveraging the IntAct database

Read more

Summary

Introduction

Critical biological processes, such as signaling cascades and metabolism, are regulated by protein-protein interactions (PPIs) that modify other proteins in order to modulate their stability or activity via post-translational modifications (PTMs). PPIs are curated in large online repositories such as IntAct [1] and HPRD [2]. Most PPIs are not annotated with a function, for example, we found the IntAct database has over 100,000 human PPIs, but less than 4000 of these are annotated with PTMs such as phosphorylation, acetylation or methylation. Understanding the nature of PTM between an interacting protein pair is critical for researchers to determine the impact of network perturbations and downstream biological consequences. PPIs and PTMs in biological databases are usually manually curated, which is time consuming and requires highly trained curators. The adoption of automated curation methods is essential for sustainability of this work

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call