Text Retrieval Method Integrating Scientific Knowledge and Pre-trained Model

Wencong Yuan,Guohui He

doi:10.1109/iccsmt58129.2022.00055

Abstract

The amount of scientific literature is growing year after year in tandem with the Internet’s fast expansion, and effective scientific literature retrieval is helpful for users’ scientific research work. There are unregistered technical terms in scientific literature, which causes the problem of semantic bias in the text representation of traditional retrieval methods, resulting in low retrieval precision, recall, and accuracy. Aiming at the above problems, a text retrieval method that integrates scientific knowledge and pre-trained models is proposed. This method initially mines new words from scientific publications using a novel word mining algorithm based on mutual information and adjacency entropy. And takes them as the full word mask object of the RoBERTa-wwm pre-training model for secondary pre-training, so that the pre-training model can integrate scientific knowledge and improve the ability to understand scientific knowledge. The matching semantic vector is then obtained by first encoding the original text using the secondary pre-training model, and then one-dimensional convolution is used for feature extraction to extract key text features through the Max-pooling layer. The completely linked layer is then mapped to the text characteristics, and the matching scores between texts are computed. According to experimental findings, our technique outperforms numerous other models on text retrieval tasks in the scientific sector.

Full Text