Abstract

Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.

Highlights

  • Background & SummaryChemical names are one of the most searched entity types in PubMed[1], and chemical entities appear throughout the biomedical research literature, encompassing studies from various disciplines beyond chemistry, such as medicine, biology, and pharmacology, etc

  • Chemical name variants can include an innumerable mix of typographical variants, alternating uses of hyphens, brackets, spacing, and word order. (For example: 5,6-Epoxy-8,11,14-eicosatrienoic acid, 5,6-EET, 5(6)epoxyeicosatrienoic acid, 5(6)-oxido-8,11,14-eicosatrienoic acid, 5(6)-oxidoeicosatrienoic acid, 5,6-epoxyeicosatrienoic acid)

  • We decided to have this document set be complementary to other Chemical entity recognition corpora such as CHEMDNER or BC5CDR; it was decided that the selection would target articles for which human annotation was most valuable

Read more

Summary

Introduction

Background & SummaryChemical names are one of the most searched entity types in PubMed[1], and chemical entities appear throughout the biomedical research literature, encompassing studies from various disciplines beyond chemistry, such as medicine, biology, and pharmacology, etc. For example: dipotassium 2-alkylbenzotriazolyl bis(trifluoroborate)s, 4,7-dibromo-2-octyl-2,1,3-benzotriazole, etc The abbreviation “MTT” can be mapped to more than 800 different strings in PubMed, including 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide, methyl thiazolyl tetrazolium, mean transit time, malignant triton tumour, and myoblast transfer therapy.

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call