Abstract

In recent years, the machine learning based systems make an impressive performance only the same distribution of training data and test data. It means that the systems sometimes do not perform well on newly generated data because new data have different distribution with existing training data. In particular, named entity recognition (NER) task often face this problem due to numerous newly created named entities over time. However it is challenging to annotate for every newly generated text data manually. Therefore we propose the method of reducing the cost of manual annotation for recently generated texts by using a large amount of unlabeled data. We first automatically recognize named entities in unlabeled data with a knowledge-base (KB). Automatic annotation for unstructured data costs less, but it has considerable noise because they are not a golden-standard data. To overcome this problem, we next apply a transfer learning approach that reduces the influence of noise from automatically annotated data. In our transfer learning approach, the automatically annotated data are used as data for pre-training the NER model, and then the model is finetuned with the existing manually annotated data. We evaluate our proposed method with three different distributions. Experimental results demonstrate that our approach improves the performance of NER in recent texts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call