Abstract

BackgroundWhen term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity.MethodsWe have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached.ResultsWe evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names.ConclusionOur approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.

Highlights

  • When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available

  • Results are expressed according to recall (R), precision (P), and F-measure (F), which here measure how accurately our various experiments determined the left boundary (Left), the right boundary (Right), and both boundaries (Full) of protein names

  • Adding protein names appearing in the training data to the dictionary further improved the F-score by 0.64 to 73.78, which is a state-of-the-art performance in protein name recognition using the JNLPBA-2004 data set

Read more

Summary

Introduction

When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. It is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. One way of populating these knowledge bases is through named entity recognition (NER). Biomedical NER faces many problems, e.g., protein names are extremely difficult to recognize due to high ambiguity and variability. Protein IL-2 and mathematical expression IL – 2 fall into the same token sequence IL – 2 as usually dash (or hyphen) is designated as a token delimiter In this sense, protein name recognition from tokenized sequence is more challenging than that from text

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.