Abstract

BackgroundBiomedical literature concerns a wide range of concepts, requiring controlled vocabularies to maintain a consistent terminology across different research groups. However, as new concepts are introduced, biomedical literature is prone to ambiguity, specifically in fields that are advancing more rapidly, for example, drug design and development. Entity linking is a text mining task that aims at linking entities mentioned in the literature to concepts in a knowledge base. For example, entity linking can help finding all documents that mention the same concept and improve relation extraction methods. Existing approaches focus on the local similarity of each entity and the global coherence of all entities in a document, but do not take into account the semantics of the domain.ResultsWe propose a method, PPR-SSM, to link entities found in documents to concepts from domain-specific ontologies. Our method is based on Personalized PageRank (PPR), using the relations of the ontology to generate a graph of candidate concepts for the mentioned entities. We demonstrate how the knowledge encoded in a domain-specific ontology can be used to calculate the coherence of a set of candidate concepts, improving the accuracy of entity linking. Furthermore, we explore weighting the edges between candidate concepts using semantic similarity measures (SSM). We show how PPR-SSM can be used to effectively link named entities to biomedical ontologies, namely chemical compounds, phenotypes, and gene-product localization and processes.ConclusionsWe demonstrated that PPR-SSM outperforms state-of-the-art entity linking methods in four distinct gold standards, by taking advantage of the semantic information contained in ontologies. Moreover, PPR-SSM is a graph-based method that does not require training data. Our method improved the entity linking accuracy of chemical compounds by 0.1385 when compared to a method that does not use SSMs.

Highlights

  • Biomedical literature concerns a wide range of concepts, requiring controlled vocabularies to maintain a consistent terminology across different research groups

  • As most state-of-the-art Named Entity Recognition (NER) systems are based on machine learning algorithms, they focus on recognizing segments of text that refer to entities of interest, requiring an additional method to match each named entity to a knowledge base (KB)

  • Data We evaluated our method on three gold standards, consisting of biomedical documents manually annotated with ontology concepts

Read more

Summary

Introduction

Biomedical literature concerns a wide range of concepts, requiring controlled vocabularies to maintain a consistent terminology across different research groups. Entity linking matches each entity mention in a document to an entry of a knowledge base (KB) that unequivocally represents that concept [1, 2] This task is a fundamental component of text mining systems, in order to integrate the information described in the literature across multiple documents [3]. By directly matching a list of concept names and synonyms from a Lamurias et al BMC Bioinformatics (2019) 20:534 controlled vocabulary to the text, it is possible to directly obtain the respective identifiers. This approach will be restricted to the names and synonyms considered in the KB, even when string matching algorithms can be used to deal with misspellings and other lexical variations. The objective of this approach is to select the set of candidate matches that maximizes the global coherence between entities

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call