Abstract

The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach.Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.

Highlights

  • While publications such as those found in the biomedical literature contain a significant amount of useful chemical information [1], much of the useful information on medicinal chemistry is found in less formal documents, such as patents

  • NCBI participated in both the CEMP [chemical named entity recognition (NER)] and gene product and related object (GPRO) subtasks

  • We addressed the CEMP subtask using an ensemble system that combines the results from five individual systems, trained with different data to create a total of ten models

Read more

Summary

Introduction

While publications such as those found in the biomedical literature contain a significant amount of useful chemical information [1], much of the useful information on medicinal chemistry is found in less formal documents, such as patents. We address both the CEMP and GPRO tasks with an ensemble approach, combining the results of several models to improve performance. At the gene/protein NER task at the first BioCreative challenge, one participant combined a support vector machine and two hidden markov models using majority vote [15]. We addressed the CEMP subtask using an ensemble system that combines the results from five individual systems, trained with different data to create a total of ten models. We evaluated our ensemble and the individual models created for the CEMP task in terms of precision, recall and f-score, requiring the predicted span to match the span annotated to consider it a true positive.

Evaluation sets
Methods
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.