Mining chemical patents with an ensemble of open systems.

Robert Leaman,Zhiyong Lu,Cherry Zou,Chih-Hsuan Wei

doi:10.1093/database/baw065

Robert Leaman, Zhiyong Lu + Show 2 more

Open Access

https://doi.org/10.1093/database/baw065

Copy DOI

Journal: Database	Publication Date: Jan 1, 2016
Citations: 15	License type: cc-by

Affiliation: National Center for Biotechnology Information

Abstract

The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach.Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.

Highlights

While publications such as those found in the biomedical literature contain a significant amount of useful chemical information [1], much of the useful information on medicinal chemistry is found in less formal documents, such as patents
NCBI participated in both the CEMP [chemical named entity recognition (NER)] and gene product and related object (GPRO) subtasks
We addressed the CEMP subtask using an ensemble system that combines the results from five individual systems, trained with different data to create a total of ten models

Summary

Introduction

While publications such as those found in the biomedical literature contain a significant amount of useful chemical information [1], much of the useful information on medicinal chemistry is found in less formal documents, such as patents. We address both the CEMP and GPRO tasks with an ensemble approach, combining the results of several models to improve performance. At the gene/protein NER task at the first BioCreative challenge, one participant combined a support vector machine and two hidden markov models using majority vote [15]. We addressed the CEMP subtask using an ensemble system that combines the results from five individual systems, trained with different data to create a total of ten models. We evaluated our ensemble and the individual models created for the CEMP task in terms of precision, recall and f-score, requiring the predicted span to match the span annotated to consider it a true positive.

Evaluation sets

Methods

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Mining chemical patents with an ensemble of open systems.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database

Lead the way for us

Similar Papers

Towards a Unified Named Entity Recognition System - Disease Mention Identification
Keun Ho Ryu ... Tsendsuren Munkhdalai
-
Keun Ho Ryu, et. al.Keun Ho Ryu ... Tsendsuren Munkhdalai
01 Jan 2015
01 Jan 2015

Named Entity Recognition Using Acyclic Weighted Digraphs: A Semi-supervised Statistical Method
Kono Kim ... Harksoo Kim
-
Kono Kim, et. al.Kono Kim ... Harksoo Kim
22 May 2007
22 May 2007

Hindi named entity recognition using system combination
Kamal Sarkar
International Journal of Applied Pattern Recognition | VOL. 5
Kamal SarkarKamal Sarkar
01 Jan 2018
International Journal of Applied Pattern Recognition | VOL. 5

A Multiengine NER System with Context Pattern Learning and Post-processing Improves System Performance
Asif Ekbal ... Sivaji Bandyopadhyay
International Journal of Computer Processing of Languages | VOL. 22
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
01 Jun 2009
International Journal of Computer Processing of Languages | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining chemical patents with an ensemble of open systems.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database