Abstract

Active learning (AL) methods for Named Entity Recognition (NER) perform best when train and test data are drawn from the same distribution. However, only limited research in active learning has considered how to leverage the similarity between train and test data distributions. This is especially critical for clinical NER due to the rare concept (e.g., Symptoms) issue. Therefore, in this paper we present a novel AL method for clinical NER that selects the most beneficial instances for training by comparing train and test data distributions via low computational cost similarity metrics. When using GloVe embeddings, our method outperforms baseline AL methods by up to 11% in terms of reduction of training data required to reach the best performance of a target NER model. In addition, our method outperforms the baselines by a high margin in the first 20 iterations. The average margin exceeds 10% on both ShARe/CLEF 2013 and i2b2/VA 2010. When using BioBERT embeddings, our method outperforms baseline AL methods by up to 6% in terms of reduction of training data required to reach the target NER model performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call