Extracting chemical-protein relations with ensembles of SVM and deep learning models.

Yifan Peng,Anthony Rios,Zhiyong Lu,Ramakanth Kavuluru

doi:10.1093/database/bay073

Abstract

Mining relations between chemicals and proteins from the biomedical literature is an increasingly important task. The CHEMPROT track at BioCreative VI aims to promote the development and evaluation of systems that can automatically detect the chemical–protein relations in running text (PubMed abstracts). This work describes our CHEMPROT track entry, which is an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions. Our CHEMPROT system obtained 0.7266 in precision and 0.5735 in recall for an F-score of 0.6410 during the challenge, demonstrating the effectiveness of machine learning-based approaches for automatic relation extraction from biomedical literature and achieving the highest performance in the task during the 2017 challenge.Database URL: http://www.biocreative.org/tasks/biocreative-vi/track-5/

Highlights

Recognizing the relations between chemicals and proteins is crucial in various tasks such as precision medicine, drug discovery and basic biomedical research
Our observation indicated that pairs are more difficult to classify in longer sentences and the recurrent neural networks (RNNs) model can detect distant pairs better than other individual models
We built every support vector machines (SVMs), convolutional neural networks (CNNs) and RNN model using 80% total data in the training and development sets and built the ensemble system using the remaining 20% of the total data

Summary

Introduction

Recognizing the relations between chemicals and proteins is crucial in various tasks such as precision medicine, drug discovery and basic biomedical research. Biomedical researchers study various associations between chemicals and proteins and disseminate their findings in scientific publications. Manually extracting chemical–protein relations from the biomedical literature is possible, it is costly and timeconsuming. Text-mining methods could automatically detect these relations effectively. The BioCreative VI track 5 CHEMPROT task (http://www.biocreative.org/tasks/ biocreative-vi/track-5/) aims to promote the development and evaluation of systems that can automatically detect and classify relations between chemical compounds/drug and proteins [1] in running text (PubMed abstracts). The relation encoded in the text is represented in a standoff-style annotation as follows. The organizers used ‘gene’ and ‘protein’ interchangeably in this task

Objectives

Methods

Results

Conclusion