Abstract

SummaryNamed entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora.Availability and implementationHunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems.Supplementary informationSupplementary data are available at Bioinformatics online.

Highlights

  • Recognizing biomedical entities (NER) such as genes, chemicals or diseases in unstructured scientific text is a crucial step of all biomedical information extraction pipelines

  • HUNER does not build upon a pretrained language model (LM), such models were the basis for many recent breakthroughs in NLP research (Akbik et al, 2019)

  • We compare the tagging accuracy of HunFlair to two types of competitors: Other ‘off-the-shelf’ biomedical Named entity recognition (NER) tools, and other recent research prototypes

Read more

Summary

Introduction

Recognizing biomedical entities (NER) such as genes, chemicals or diseases in unstructured scientific text is a crucial step of all biomedical information extraction pipelines. In any real application they are applied ‘in the wild’, i.e. to a large collection of texts often varying in focus, entity distribution, genre (e.g. patents versus scientific articles) and text type (e.g. abstract versus full text) This mismatch can lead to severely misleading evaluation results. HunFlair builds upon a pretrained character-level language model It recognizes five important biomedical entity types with high accuracy, namely Cell Lines, Chemicals, Diseases, Genes and Species. We integrate 23 biomedical NER corpora into HunFlair using a consistent format, which enables researchers and practitioners to rapidly train their own models and experiment with new approaches within Flair. Note that these are the same corpora that were already made available through HUNER. While HUNER’s corpora came preprocessed with a particular method, users of HunFlair may process the corpora along with their own choices, for instance by using different sentence resp. word segmentation methods

Hunflair
Results
Comparison to off-the-shelf tools
Comparison to research prototypes
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.