Abstract

Aiming at the problem that the existing named entity recognition models have insufficient ability to recognize common unknown words in data, this paper proposes a text vectorization representation method based on language pre-training model. The program can't understand the text directly, and it can only be understood by the program after the text is converted into a numerical value. Firstly, this paper introduces the methods of word vector representation, including discrete representation and distributed representation. The traditional word vector representation method can't deal with the problem of polysemy and can't fully express semantic features. Aiming at the defects of word vector method, this paper proposes a text vectorization method based on language pre-training model. The idea of fine-tune is introduced, and the pre-training model, which completed training on massive data sets, is transferred to the People's Daily data set, and the parameters are optimized. Finally, this paper designs a comparative experiment on the People's Daily data set, compares it with the traditional word embedding methods using CBOW, Skip-gram and GloVe, analyzes the results, and verifies the effectiveness of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call