Abstract

Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks.

Highlights

  • Biomedical literature provides extensive information that is not covered in other knowledge resources and the amount of information produced and published in articles and patents is growing at a fast pace, the manual analysis and annotation of the literature is a tedious, time-consuming, and costly process

  • We will present an assessment of our machinelearning-based method in comparison to the dictionarybased method, Whatizit. Both methods were applied to the gold standard, but only 47% of the total amount of named entities of this corpus were mapped to Chemical Entities of Biological Interest (ChEBI) at that time by the curators

  • We decided that an enrichment in the mapping of the annotated entities was necessary to significantly improve the amount of chemical named entities mapped to ChEBI

Read more

Summary

Introduction

Biomedical literature provides extensive information that is not covered in other knowledge resources and the amount of information produced and published in articles and patents is growing at a fast pace, the manual analysis and annotation of the literature is a tedious, time-consuming, and costly process. This process has been addressed by text-mining systems that have already shown to be helpful in speeding up some steps of this process [1]. Entity resolution takes as input the strings identified in the previous task, in order to find exactly which chemical each string corresponds to, by mapping each of them to a reference database entry

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call