Abstract

Distant supervision has been proven to be an efficient way of generating labeled instances for Named Entity Recognition (NER). However, it suffers from dictionary biases and ambiguous entities, resulting in noisy and incomplete labels. To overcome this drawback, this paper proposes a template augmented distant supervision framework, which generates high-quality labeled training data with minimal human effort. Specifically, we use distant supervision to extract sentences that contain entities and apply a pre-trained language model to encode these sentences. The encoded sentences are clustered and then for each cluster, three sentences are sampled out to form a seed template pool. The seed templates are calibrated and decomposed to decouple the connection between different parts. Finally, the seed templates and entity dictionary are combined with pre-trained language model to generate semantically coherent and precisely labeled training data. Experimental results on the EC and NEWS datasets and a practical electronic after-sale Q&A dataset with multiple pre-trained language models demonstrate that the proposed framework is able to improve the F1 score of the distantly supervised NER models by 7.9%–12.9%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call