Abstract

As one of the most common and critical tasks in natural language processing (NLP), similarity calculation has a wide range of applications in fields such as censorship detection and information retrieval. In order to improve the accuracy of text similarity calculation, based on the in-depth analysis of traditional LCS algorithms, this paper proposes an extended LCS algorithm based on word separation and synonym matching. The algorithm combines the new achievements in natural language processing research, and addresses the problems that LCS cannot screen the common means of plagiarism and nesting when used for text similarity comparison, as well as the high time complexity and weak performance of the backtracking algorithm used in solving the longest common subsequence, and realizes the matching of synonyms between sequences by calculating the similarity between words through synonym word forest on the basis of word separation. It achieves the screening of plagiarism and copying means such as synonym substitution to the original text. At the same time, the algorithm improves the traditional algorithm for solving LCS subsequences by recording the associated positions of the characters of the longest common subsequence in each sequence and appropriately increasing the spatial complexity to realize the chain marking of common sequences. The experimental results show that the LCS extension algorithm proposed in this paper can accurately identify the substitution of synonyms in the text, and the calculation result of text similarity is more accurate, while the time complexity of solving the LCS subsequence is reduced from <I>O</I>(2<sup><i>max</i>(<i>m</i>,<i>n</i>)</sup>) to linear level <I>O</I>(<i>n</i>).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.