Abstract

Since the short text has characteristics such as sparse features, calculating its similarity is a considerable challenge. However, there is less research on the method of Chinese short text feature extension in short text similarity calculation. Therefore, to have a deeper understanding of the method on using feature extension in the similarity of Chinese short texts, this paper adopts a feature extension algorithm based on an external thesaurus Tongyici Cilin (extended) for short texts. The purpose is to solve the feature sparseness problem of Chinese short text feature vectors. Firstly, segment words in the short text according to certain rules with high surface similarity and extract the main difference components in the text. Then, calculate the similarity of the major difference components between the two short texts based on Cilin. Finally, perform feature extension according to the similar results in the corresponding short text. In the large-scale Chinese Question Matching Corpus LCQMC, a variety of unsupervised models are used for testing. The experimental results show that the method in this paper has a certain improvement effect on various spatial vector similarity algorithms. It can achieve accuracy rates and F1-score of about 3% improvements.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.