Abstract
Paraphrase identification is a semantic text similarity task which is an important part of many natural language processing applications. Existing methods use vector space models, word co-occurrence information, lexical databases, parsers and machine translation (MT) evaluation metrics to find text similarity. However, other aspects such as negations, inverse relations and semantic roles of the sentences are also very much important in identifying paraphrases. Furthermore, the semantics of the sentences are hidden when the sentences are complex. We propose an approach to find similarity between pair of texts by considering all these factors. We have used an approach to determine set of clauses present in the texts by resolving conjunctions in complex sentences that identify hidden triples from the text. The approach extracts clause-based similarity features namely concept score, relation score, proposition score and word score from the texts. We have combined these similarity features along with MT metrics features to identify whether the texts are paraphrases or not using Support Vector Machine model. We have evaluated our methodology to measure the paraphrase similarity for Microsoft Research corpus. The statistical tests namely |$k$|-fold paired |$t$|-test and McNemar's test show that including clause-based features significantly improved the performance. Also, our approach outperforms state-of-the-art methods in terms of accuracy, |$F$|1-measure and |$f$|1-measure.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.