Abstract

The exponentially increasing size of biomedical literature and the limited ability of manual curators to discover protein–protein interactions (PPIs) in text has led to delays in keeping PPI databases updated with the current findings. The state-of-the-art text mining methods for PPI extraction are primarily based on deep learning (DL) models, and the performance of a DL-based method is mainly affected by the architecture of DL models and the feature embedding methods. In this study, we compared different architectures of DL models, including convolutional neural networks (CNN), long short-term memory (LSTM), and hybrid models, and proposed a hybrid architecture of a bidirectional LSTM+CNN model for PPI extraction. Pretrained word embedding and shortest dependency path (SDP) embedding are fed into a two-embedding channel model, such that the model is able to model long-distance contextual information and can capture the local features and structure information effectively. The experimental results showed that the proposed model is superior to the non-hybrid DL models, and the hybrid CNN+Bidirectional LSTM model works well for PPI extraction. The visualization and comparison of the hidden features learned by different DL models further confirmed the effectiveness of the proposed model.

Highlights

  • Protein–protein interactions (PPIs) play important roles in various biological processes and are of pivotal importance in the regulation of biological systems, and are implicated in the development of disease states [1]

  • convolutional neural networks (CNN) is applied to encode the important information contained in the bidirectional long short-term memory (LSTM) networks, and to extract the local features and structure information effectively

  • CNN is applied to encode the important information contained in the bidirectional LSTM networks, and to capture the local features and structure information effectively

Read more

Summary

Introduction

Protein–protein interactions (PPIs) play important roles in various biological processes and are of pivotal importance in the regulation of biological systems, and are implicated in the development of disease states [1]. Traditional ML-based methods usually collect words around target entities as features, such as unigram, bigram, and some semantic and syntactic features, and these features are put into a bag-of-words model and encoded into one-hot type representations. Such representations are unable to capture semantic relations among words or phrases and fail in generalizing the long context dependency [13]. Recent studies have proposed several feature embedding methods combining DL models for PPI extraction Most of these studies focused on finding effective linguistic features for embedding or on tuning model hyperparameters for a certain framework of DL (e.g., convolutional neural networks (CNN) and long short-term memory (LSTM)).

Related Work
The Model Description
Model Input
Intermediate Structure
Datasets and Preprocessing
Pretrained Embeddings
Implementation and Evaluation Metrics
Performance Comparison
Hidden Features Visualization and Comparison
Findings
Discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.