Research on LCS Word Segmentation Expansion and Optimization of Subsequence Solving Algorithm

Mo Ruoyu,Yang Ting,Zhu Zhousen,Zhang Xiujuan

doi:10.57237/j.cst.2023.01.006

Abstract

As one of the most common and critical tasks in natural language processing (NLP), similarity calculation has a wide range of applications in fields such as censorship detection and information retrieval. In order to improve the accuracy of text similarity calculation, based on the in-depth analysis of traditional LCS algorithms, this paper proposes an extended LCS algorithm based on word separation and synonym matching. The algorithm combines the new achievements in natural language processing research, and addresses the problems that LCS cannot screen the common means of plagiarism and nesting when used for text similarity comparison, as well as the high time complexity and weak performance of the backtracking algorithm used in solving the longest common subsequence, and realizes the matching of synonyms between sequences by calculating the similarity between words through synonym word forest on the basis of word separation. It achieves the screening of plagiarism and copying means such as synonym substitution to the original text. At the same time, the algorithm improves the traditional algorithm for solving LCS subsequences by recording the associated positions of the characters of the longest common subsequence in each sequence and appropriately increasing the spatial complexity to realize the chain marking of common sequences. The experimental results show that the LCS extension algorithm proposed in this paper can accurately identify the substitution of synonyms in the text, and the calculation result of text similarity is more accurate, while the time complexity of solving the LCS subsequence is reduced from O(2max(m,n)) to linear level O(n).

Full Text