Abstract

The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.

Highlights

  • The large number of chemical and pharmaceutical patents have prompted active research in biological text mining

  • The evaluation script of BeClam [22] is used, and we find that the evaluation script of BeClam is combining gene product (GPRO) Type 1 and 2 instead of only GPRO Type 1 which is used in Biocreative V GPRO task [4]

  • In this paper, we have described the construction of an statistical-principle-based approach (SPBA)-conditional random fields (CRF)-based system that can automatically recognize GPRO mentions in chemical patents

Read more

Summary

Introduction

The large number of chemical and pharmaceutical patents have prompted active research in biological text mining. Named entity recognition (NER) is a fundamental task in biomedical text mining involving extraction of words or phrases that refer to specific entities, such as genes, diseases and chemicals. Given a patent abstract, a text mining system should identify the boundaries of GPRO mentions. The GPRO task is more challenging than other gene mention recognition tasks, like JNLPBA [2] and Biocreative II GM [3], in the following two aspects. This is because the spans of GPRO mentions are highly related A2A receptorsGPRO_TYPE_1....” This is because the spans of GPRO mentions are highly related

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call