D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information.

Hoang-Quynh Le,Thanh Hai Dang,Jonathan Wren,Sinh T Vu,Trang M Nguyen

doi:10.1093/bioinformatics/bty356

Hoang-Quynh Le, Thanh Hai Dang + Show 3 more

Open Access

https://doi.org/10.1093/bioinformatics/bty356

Copy DOI

Abstract

Recognition of biomedical named entities in the textual literature is a highly challenging research topic with great interest, playing as the prerequisite for extracting huge amount of high-valued biomedical knowledge deposited in unstructured text and transforming them into well-structured formats. Long Short-Term Memory (LSTM) networks have recently been employed in various biomedical named entity recognition (NER) models with great success. They, however, often did not take advantages of all useful linguistic information and still have many aspects to be further improved for better performance. We propose D3NER, a novel biomedical named entity recognition (NER) model using conditional random fields and bidirectional long short-term memory improved with fine-tuned embeddings of various linguistic information. D3NER is thoroughly compared with seven very recent state-of-the-art NER models, of which two are even joint models with named entity normalization (NEN), which was proven to bring performance improvements to NER. Experimental results on benchmark datasets, i.e. the BioCreative V Chemical Disease Relation (BC5 CDR), the NCBI Disease and the FSU-PRGE gene/protein corpus, demonstrate the out-performance and stability of D3NER over all compared models for chemical, gene/protein NER and over all models (without NEN jointed, as D3NER) for disease NER, in almost all cases. On the BC5 CDR corpus, D3NER achieves F1of93.14and84.68% for the chemical and disease NER, respectively; while on the NCBI Disease corpus, its F1 for the disease NER is 84.41%. Its F1 for the gene/protein NER on FSU-PRGE is 87.62%. Data and source code are available at: https://github.com/aidantee/D3NER. Supplementary data are available at Bioinformatics online.

Full Text