Abstract

Paraphrase identification is a semantic text similarity task which is an important part of many natural language processing applications. Existing methods use vector space models, word co-occurrence information, lexical databases, parsers and machine translation (MT) evaluation metrics to find text similarity. However, other aspects such as negations, inverse relations and semantic roles of the sentences are also very much important in identifying paraphrases. Furthermore, the semantics of the sentences are hidden when the sentences are complex. We propose an approach to find similarity between pair of texts by considering all these factors. We have used an approach to determine set of clauses present in the texts by resolving conjunctions in complex sentences that identify hidden triples from the text. The approach extracts clause-based similarity features namely concept score, relation score, proposition score and word score from the texts. We have combined these similarity features along with MT metrics features to identify whether the texts are paraphrases or not using Support Vector Machine model. We have evaluated our methodology to measure the paraphrase similarity for Microsoft Research corpus. The statistical tests namely |$k$|-fold paired |$t$|-test and McNemar's test show that including clause-based features significantly improved the performance. Also, our approach outperforms state-of-the-art methods in terms of accuracy, |$F$|1-measure and |$f$|1-measure.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call