Abstract

In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora, which often yields unsatisfactory results. We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on a huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted a self-supervised loss function used in ALBERT that targets modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on 8 biomedical NER benchmark datasets with 4 different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI- disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61 % (BC5CDR-Chem) and 3.89% (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) species type corpora by 6.19% (LINNAEUS) and 23.71 % (Species-800) is observed which leads to a state-of-the-art results. The performance of a proposed model on four different biomedical entity types shows that our model is robust and generalisable in recognising biomedical entities in text.

Highlights

  • In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially

  • To overcome the identified limitations, we present biomedical A Lite Bidirectional Encoder Representations from Transformers (ALBERT) – bioALBERT - a contextdependent, fast and effective language model that addresses the shortcomings of recently proposed domainspecific language models

  • We found that all models BioALBERT outperforms biomedical BERT (BioBERT) with a considerable margin which makes it more faster and practical as compared to BioBERT models

Read more

Summary

Results

We present the dataset used, baselines and evaluation to demonstrate the effectiveness of our model. An attention based BiLSTM-CRF, was proposed by Lou et al [27] for chemical named entity recognition (CNER) This approach leverages global document-level information gathered through the attention process to ensure continuity of labelling across several cases of the similar token in a document and achieved good results with some feature engineering. Authors of BioBERT model demonstrated that training BERT on biomedical corpus improve the performance on BioNER and outperforms previously presented models for BioNER. BioALBERT gives better performance and addresses the previously mentioned challenges in the biomedical domain We attribute this to the BioALBERT built on top of the transformer-based language model that learns contextual relationship between words in the corpus. This demonstrates that the relevance of duplication data in NLP tasks

Conclusions
Background
Conclusion
Methods
Evaluation Batch Size
Evaluation Batch Size Size
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call