Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction.

Farrokh Mehryary,Tapio Salakoski,Filip Ginter,Jari Björne

doi:10.1093/database/bay120

Farrokh Mehryary, Tapio Salakoski + Show 2 more

Open Access

https://doi.org/10.1093/database/bay120

Copy DOI

Journal: Database	Publication Date: Jan 1, 2018
Citations: 14	License type: CC BY 4.0

Affiliation: University of Turku

Abstract

Biomedical researchers regularly discover new interactions between chemical compounds/drugs and genes/proteins, and report them in research literature. Having knowledge about these interactions is crucially important in many research areas such as precision medicine and drug discovery. The BioCreative VI Task 5 (CHEMPROT) challenge promotes the development and evaluation of computer systems that can automatically recognize and extract statements of such interactions from biomedical literature. We participated in this challenge with a Support Vector Machine (SVM) system and a deep learning-based system (ST-ANN), and achieved an F-score of 60.99 for the task. After the shared task, we have significantly improved the performance of the ST-ANN system. Additionally, we have developed a new deep learning-based system (I-ANN) that considerably outperforms the ST-ANN system. Both ST-ANN and I-ANN systems are centered around training an ensemble of artificial neural networks and utilizing different bidirectional Long Short-Term Memory (LSTM) chains for representing the shortest dependency path and/or the full sentence. By combining the predictions of the SVM and the I-ANN systems, we achieved an F-score of 63.10 for the task, improving our previous F-score by 2.11 percentage points. Our systems are fully open-source and publicly available. We highlight that the systems we present in this study are not applicable only to the BioCreative VI Task 5, but can be effortlessly re-trained to extract any types of relations of interest, with no modifications of the source code required, if a manually annotated corpus is provided as training data in a specific file format.

Highlights

BioCreative VI Task 5 challenge, focuses on extraction of relations between chemical compounds/drugs and genes/proteins, stated in biomedical texts [1]
We first discuss the results of our participation in the shared task and focus on the improved results we obtained using the improved ANN (I-ANN) and its combination with the Support Vector Machine (SVM) system
Unlike the SVM system, our shared task artificial neural network (ST-ANN) and I-ANN systems are based on deep learning and require less feature engineering

Summary

Introduction

BioCreative VI Task 5 challenge (hereinafter referred to as the ‘shared task’), focuses on extraction of relations between chemical compounds/drugs and genes/proteins, stated in biomedical texts [1]. Pyysalo et al [10] have preprocessed and unified five publicly available protein–protein interaction corpora (http://mars.cs.utu.fi/PPICorpora/), in order to facilitate seamless development and comparison of biomedical relation extraction methods Among these tasks, DDI-2013 [6] has become popular for assessing the performance of relation extraction methods, mainly because it has a relatively large and challenging corpus. In addition to the named entities, the training data for these tasks include manually annotated relations, making these tasks ideal for the development of supervised relation extraction methods These machine learning-based methods utilize the provided training data to train a classifier—e.g. an SVM, an ANN or a Naive Bayes classifier—capable of detecting statements of relations in texts. Raihani et al [17] achieved the impressive F-score of 71.14 on the DDI-2013 corpus with a system utilizing lexical, phrase, verb, syntactic and auxiliary features

Methods

Results

Conclusion