Abstract

As a basic research topic in natural language processing, the calculation of text similarity is widely used in the fields of plagiarism checker and sentence search. The traditional calculation of text similarity constructed text vectors only based on TF-IDF, and used the cosine of the angle between vectors to measure the similarity between two texts. However, this method cannot solve the similar text detection task with different text representation but similar semantic representation. In response to the above-mentioned problems, we proposed the pre-training of text based on the ERNIE semantic model of PaddleHub, and constructed similar text detection into a classification problem; in view of the problem that most of the similar texts in the data set led to the imbalance of categories in the training set, an oversampling method for confusion sampling, OSConfusion, was proposed. The experimental results showed that the method proposed in this paper was able to solve the problem of paper comparison well, and could identify the repetitive paragraphs with different text representations. And the ERNIE-SIM with OSConfusion was better than the ERNIE-SIM without OSConfusion in the prediction process of similar document pairs in terms of precision and recall.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call