Abstract

BackgroundRecent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks. However, these techniques rarely take advantage of existing domain-specific resources, such as ontologies. In Life and Health Sciences there is a vast and valuable set of such resources publicly available, which are continuously being updated. Biomedical ontologies are nowadays a mainstream approach to formalize existing knowledge about entities, such as genes, chemicals, phenotypes, and disorders. These resources contain supplementary information that may not be yet encoded in training data, particularly in domains with limited labeled data.ResultsWe propose a new model to detect and classify relations in text, BO-LSTM, that takes advantage of domain-specific ontologies, by representing each entity as the sequence of its ancestors in the ontology. We implemented BO-LSTM as a recurrent neural network with long short-term memory units and using open biomedical ontologies, specifically Chemical Entities of Biological Interest (ChEBI), Human Phenotype, and Gene Ontology. We assessed the performance of BO-LSTM with drug-drug interactions mentioned in a publicly available corpus from an international challenge, composed of 792 drug descriptions and 233 scientific abstracts. By using the domain-specific ontology in addition to word embeddings and WordNet, BO-LSTM improved the F1-score of both the detection and classification of drug-drug interactions, particularly in a document set with a limited number of annotations. We adapted an existing DDI extraction model with our ontology-based method, obtaining a higher F1 score than the original model. Furthermore, we developed and made available a corpus of 228 abstracts annotated with relations between genes and phenotypes, and demonstrated how BO-LSTM can be applied to other types of relations.ConclusionsOur findings demonstrate that besides the high performance of current deep learning techniques, domain-specific ontologies can still be useful to mitigate the lack of labeled data.

Highlights

  • Recent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks

  • We propose a new model, BO-Long Short-Term Memory (LSTM) that can explore domain information from ontologies to improve the task of biomedical relation extraction using deep learning techniques

  • The authors of the Shortest Dependency Paths (SDP)-LSTM model showed that WordNet contributed to an improvement of the F1-score on a relation extraction task

Read more

Summary

Introduction

Recent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks These techniques rarely take advantage of existing domain-specific resources, such as ontologies. Biomedical ontologies are nowadays a mainstream approach to formalize existing knowledge about entities, such as genes, chemicals, phenotypes, and disorders These resources contain supplementary information that may not be yet encoded in training data, in domains with limited labeled data. Deep learning techniques have obtained promising results in various Natural Language Processing (NLP) tasks [4], including relation extraction [5] These techniques have the advantage of being adaptable to multiple domains, using models pre-trained on unlabeled documents [6]. These models can use unlabeled data to predict the most probable word according to the context words (or vice-versa), leading to meaningful vector representations of the words in a corpus, known as word embeddings.

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call