An Improved Simhash Algorithm for Academic Paper Checking System

Mengxia Wang,Wenqiang Fan

doi:10.1109/dsa52907.2021.00109

Abstract

In the academic paper checking system, aiming at the problems of easy loss of local information and low accuracy of the traditional Simhash algorithm in calculating text similarity, an improved Simhash algorithm is proposed around the "weighted methods" of the algorithm. Through the TF-IDF algorithm, the method of "word frequency and inverse document frequency" is used to calculate the weight of text feature words, and the text similarity is calculated according to steps. A large amount of experimental data shows that the improved algorithm not only makes the obtained text information more comprehensive, but also the accuracy of the algorithm is increased from 66.67% to 88.89% compared with the traditional algorithm, so the numerical results illustrate that the proposed algorithm is effective.

Full Text