Abstract

With the rapid growth of Internet penetration, more and more people choose the Internet to express their views on topics of interest. In recent years, named entity recognition (NER) is becoming a popular task for the public to obtain structured information from public opinion text. At present, NER models with good results, such as deep learning model, need a lot of labeled data for training. However, this will give rise to a problem: labeling a large amount of data requires a lot of human resources, which is thankless in some areas. Therefore, in this paper, we propose a NER model combining active learning and deep learning methods. Firstly, the active learning method can solve the above problem. The strategy combines uncertainty-based sampling and diversity-based sampling to estimate the information of data. We use highly informative data as the initial training dataset. Secondly, this paper uses a deep learning model combining bidirectional encoder representations from Transformers, bidirectional long–short-term memory and conditional random field (BERT-BiLSTM-CRF). BERT extracts the semantic features of data, and BiLSTM predicts the probability distribution of entity labels. We use the CRF for decoding the probability distribution into corresponding entity labels. Finally, we use the initial training dataset for training BERT-BiLSTM-CRF. This model predicts the entity labels of the unlabeled data. Then, we judge if the machine-labeled data is highly reliable and expand the highly reliable data to the initial training dataset. The updated dataset retrains the NER model, so that the trained model has higher precision than the previous model. The results show that our model performs well without a large number of labeled datasets. The model achieves a precision value of 70.31%, recall rate of 74.93% and F1 score of 72.55% in the named entity recognition task, which proves the effectiveness of our model. Besides, the F1 score of BERT-BiLSTM-CRF with uncertainty-based sampling and diversity-based sampling (UD_BBC) is higher than the BiLSTM-CRF based on maximum normalized log-probability (MNLP_BiLSTM-CRF) by 9.00%, when recognizing overall entity categories. It provides a solution to the problem of named entity recognition in educational public opinion.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call