Abstract

Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature.We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks.We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance.The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem. The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator

Highlights

  • The effects of chemicals on living systems of every scale make them an exceptionally important class of entities for biomedical research and clinical applications

  • One of our submissions achieved the highest f-measure reported for the CEM subtask, and our high recall variant achieved the highest recall reported in both the CDI and CEM subtasks

  • Each abstract selected was human annotated for all chemical mentions sufficiently specific to be able to be associated with chemical structure information

Read more

Summary

Introduction

The effects of chemicals on living systems of every scale make them an exceptionally important class of entities for biomedical research and clinical applications. While extracting chemical mentions from biomedical literature has been attempted previously [4], the task has not yet yielded results approaching those of better-studied entity types such as genes/proteins [5,6,7], species [8], and diseases [9]. This is likely due in part to both the great variety of biologically relevant chemical structures and to the somewhat different properties exhibited by chemical mentions. These properties include systematic and semi-systematic methods for describing chemical structure (e.g. formulas and IUPAC names), whose highly compositional nature makes it difficult to precisely determine the entity boundaries, or even the number of entities present

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.