Abstract

Sentence similarity is the task of assessing how similar the two snippets of text are. Similarity techniques are used extensively in clustering, summarization, classification, plagiarism detection etc. Due to a small set of vocabularies, sentence similarity is considered to be a difficult problem in natural language processing. There are two issues in solving this problem: (1) Which similarity techniques to be used for word pair similarity and (2) How to generalize that to sentence pairs. We have used the weighted path, a WordNet-based similarity assessment, and the paraphrase database to obtain word pair similarity values. Thereafter, we extracted maximum values from the pairwise similarity matrix and computed a similarity value for a sentence pair. We have also incorporated a vector space model technique to form a robust similarity measure. Our method outperformed state-of-the-art methods on the STSS65 test dataset in Pearson's correlation of 87 % compared to human similarity scores. Moreover, our approach performed on par with other methods on the STSS131 test data using the same test. Our approach outperforms all the other WordNet-based methods compared on both datasets.

Highlights

  • Similar sentences may discuss the same idea, or they may be on a similar topic

  • Results on the STSS131 dataset demonstrates that our work, Latent semantic analysis (LSA), and Semantic text similarity (STS) are on par with human similarity scores, which means that these three approaches have the least average difference from human scores

  • This paper has argued that the recent approaches have not been thorough in feature extraction from similarity matrices and the importance of information content value was neglected in most of the studies

Read more

Summary

Introduction

Similar sentences may discuss the same idea, or they may be on a similar topic. Similar sentence pairs usually contain common words, link to common concepts, and have many cooccurring words. Latent semantic analysis (LSA) is a popular approach which is used extensively for NLP tasks [7] This method is based on statistics and uses the frequency values of words in both sentences to compute similarity. Semantic text similarity (STS) [9] proposes to combine three metrics to compute similarity: (1) string matching in which the number of common characters between word pairs is computed, (2) the SOCPMI approach, and (3) word order information. Two methods, namely sentence vector similarity and similarity matrix values, were combined to form a robust measure This has led to a better correlation compared to their individual results.

The proposed approach
Paraphrase database
Strongly agree
Experimental results
R2 Statistics
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.