Abstract
Paraphrase detection and generation are important natural language processing (NLP) tasks. Yet the term paraphrase is broad enough to include many fine-grained relations. This leads to different tolerance levels of semantic divergence in the positive paraphrase class among publicly available paraphrase datasets. Such variation can affect the generalisability of paraphrase classification models. It may also impact the predictability of paraphrase generation models. This paper presents a new model which can use few corpora of fine-grained paraphrase relations to construct automatically using language inference models. The fine-grained sentence level paraphrase relations are defined based on word and phrase level counterparts. We demonstrate that the fine-grained labels from our proposed system can make it possible to generate paraphrases at desirable semantic level. The new labels could also contribute to general sentence embedding techniques.
Highlights
Paraphrase detection and generation are important natural language processing (NLP)tasks
After examining the size and sentence relations in various datasets, we focus the construction efforts on two language inference datasets: Multi-Genre Natural Language Inference (MNLI) and Stanford Natural Language Inference (SNLI) and three paraphrase datasets: Microsoft Research Paraphrase Corpus (MRPC) [1], Quora Question Pair (QQP) [12] and semantic textual similarity benchmark (STS-B) [13]
Since users are not expected to have seen all questions, the dataset is bound to have relatively high false negative samples. We find that both the MNLI and SNLI classifier tend to make a wrong prediction in sentences with an ambiguous pronoun reference
Summary
Paraphrase detection and generation are important natural language processing (NLP). tasks. A dataset with a more strict rule may label the second sentence pair as a negative case Such variation can affect the generalisability of a paraphrase classification model. This paper proposes a novel method to automatically generate fine-grained paraphrase labels using language inference models. We developed a method utilising the language inference model to automatically assign fine-grained labels to sentence pairs in existing paraphrase and language inference corpora. We find that: Compared with Quora Question Pair (QQP), MRPC tolerates more semantic divergence in its positive class, which contains more directional paraphrases than equivalent ones. Language Inference (MNLI) contains more diversified sentence pairs in all three classes Such information may help researchers to design customised optimisation and to provide insights on observed performance variation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.