Abstract

Paraphrase detection and generation are important natural language processing (NLP) tasks. Yet the term paraphrase is broad enough to include many fine-grained relations. This leads to different tolerance levels of semantic divergence in the positive paraphrase class among publicly available paraphrase datasets. Such variation can affect the generalisability of paraphrase classification models. It may also impact the predictability of paraphrase generation models. This paper presents a new model which can use few corpora of fine-grained paraphrase relations to construct automatically using language inference models. The fine-grained sentence level paraphrase relations are defined based on word and phrase level counterparts. We demonstrate that the fine-grained labels from our proposed system can make it possible to generate paraphrases at desirable semantic level. The new labels could also contribute to general sentence embedding techniques.

Highlights

  • Paraphrase detection and generation are important natural language processing (NLP)tasks

  • After examining the size and sentence relations in various datasets, we focus the construction efforts on two language inference datasets: Multi-Genre Natural Language Inference (MNLI) and Stanford Natural Language Inference (SNLI) and three paraphrase datasets: Microsoft Research Paraphrase Corpus (MRPC) [1], Quora Question Pair (QQP) [12] and semantic textual similarity benchmark (STS-B) [13]

  • Since users are not expected to have seen all questions, the dataset is bound to have relatively high false negative samples. We find that both the MNLI and SNLI classifier tend to make a wrong prediction in sentences with an ambiguous pronoun reference

Read more

Summary

Introduction

Paraphrase detection and generation are important natural language processing (NLP). tasks. A dataset with a more strict rule may label the second sentence pair as a negative case Such variation can affect the generalisability of a paraphrase classification model. This paper proposes a novel method to automatically generate fine-grained paraphrase labels using language inference models. We developed a method utilising the language inference model to automatically assign fine-grained labels to sentence pairs in existing paraphrase and language inference corpora. We find that: Compared with Quora Question Pair (QQP), MRPC tolerates more semantic divergence in its positive class, which contains more directional paraphrases than equivalent ones. Language Inference (MNLI) contains more diversified sentence pairs in all three classes Such information may help researchers to design customised optimisation and to provide insights on observed performance variation

Related Work
Fine-Grained Paraphrase Relations
Observations from Language Inference Datasets
Auto Relabel Rules
Automatic Relabelling with Fine-Grained Paraphrase Relations
Three-Label Language Inference Classifiers and Initial Data Cleansing
Summary Statistics of Fine-Grained Labels
Fine-Grained Label Correctness and Accuracy Investigation
String Property Analysis
Generation Experiment
Experiment Models
Generator Results
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.