Abstract

Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biomedical text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. In this study, we take a step towards a unified NER system in biomedical, chemical and medical domain. We evaluate word representation features automatically learnt by a large unlabeled corpus for disease NER. The word representation features include brown cluster labels and Word Vector Classes (WVC) built by applying k-means clustering to continuous valued word vectors of Neural Language Model (NLM). The experimental evaluation using Arizona Disease Corpus (AZDC) showed that these word representation features boost system performance significantly as a manually tuned domain dictionary does. BANNER-CHEMDNER, a chemical and biomedical NER system has been extended with a disease mention recognition model that achieves a 77.84% F-measure on AZDC when evaluating with 10-fold cross validation method. BANNER-CHEMDNER is freely available at: https://bitbucket.org/tsendeemts/banner-chemdner.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call