Abstract
BackgroundIn biomedical text mining, named entity recognition (NER) is an important task used to extract information from biomedical articles. Previously proposed methods for NER are dictionary- or rule-based methods and machine learning approaches. However, these traditional approaches are heavily reliant on large-scale dictionaries, target-specific rules, or well-constructed corpora. These methods to NER have been superseded by the deep learning-based approach that is independent of hand-crafted features. However, although such methods of NER employ additional conditional random fields (CRF) to capture important correlations between neighboring labels, they often do not incorporate all the contextual information from text into the deep learning layers.ResultsWe propose herein an NER system for biomedical entities by incorporating n-grams with bi-directional long short-term memory (BiLSTM) and CRF; this system is referred to as a contextual long short-term memory networks with CRF (CLSTM). We assess the CLSTM model on three corpora: the disease corpus of the National Center for Biotechnology Information (NCBI), the BioCreative II Gene Mention corpus (GM), and the BioCreative V Chemical Disease Relation corpus (CDR). Our framework was compared with several deep learning approaches, such as BiLSTM, BiLSTM with CRF, GRAM-CNN, and BERT. On the NCBI corpus, our model recorded an F-score of 85.68% for the NER of diseases, showing an improvement of 1.50% over previous methods. Moreover, although BERT used transfer learning by incorporating more than 2.5 billion words, our system showed similar performance with BERT with an F-scores of 81.44% for gene NER on the GM corpus and a outperformed F-score of 86.44% for the NER of chemicals and diseases on the CDR corpus. We conclude that our method significantly improves performance on biomedical NER tasks.ConclusionThe proposed approach is robust in recognizing biological entities in text.
Highlights
In biomedical text mining, named entity recognition (NER) is an important task used to extract information from biomedical articles
Our model recorded a slightly inferior performance to GRAM-CNN on the National Center for Biotechnology Information (NCBI) corpus, our NER model achieved the best F-scores for Gene Mention corpus (GM) and Chemicals Disease Relationship (CDR) corpora
Disease mentions in the NCBI test set were “non-inherited breast carcinomas”, “sporadic T-cell leukaemia”, and “dominantly inherited neurodegeneration”, our model predicted “breast carcinomas”, “T-cell leukaemia”, and “neurodegeneration”, respectively
Summary
In biomedical text mining, named entity recognition (NER) is an important task used to extract information from biomedical articles. Proposed methods for NER are dictionary- or rule-based methods and machine learning approaches. These traditional approaches are heavily reliant on large-scale dictionaries, target-specific rules, or well-constructed corpora. These methods to NER have been superseded by the deep learning-based approach that is independent of hand-crafted features. Such methods of NER employ additional conditional random fields (CRF) to capture important correlations between neighboring labels, they often do not incorporate all the contextual information from text into the deep learning layers
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have