Abstract
In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.
Highlights
Paraphrases are sentences conveying the same meaning using alternative language expressions (Dias et al 2010)
In this paper, we propose a hybrid approach for sentence paraphrase identification
This is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance
Summary
Paraphrases are sentences conveying the same meaning using alternative language expressions (Dias et al 2010). The identification of paraphrases is explicitly related to the quantification of the amount of semantic overlap between two textual expressions. Paraphrase Identification (PI) is a useful task for many other important NLP applications including Text Summarization, Plagiarism Detection, Intelligent Tutoring Systems, Question Answering, and Machine Translation. Paraphrases can be used to substantiate the correctness of answers produced by a question answering application. Plagiarism detection is another task that can benefit from PI by identifying texts that have been restated using alternative language. In the case of Intelligent Tutoring systems, one can assess whether students’ submissions/answers are semantically equivalent to reference answers exploiting paraphrase identification
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.