Abstract

Paraphrase detection is a Natural-Language Processing (NLP) task that aims at automatically identifying whether two sentences convey the same meaning (even with different words). For the Portuguese language, most of the works model this task as a machine-learning solution, extracting features and training a classifier. In this paper, following a different line, we explore a graph structure representation and model the paraphrase identification task over a heterogeneous network. We also adopt a back-translation strategy for data augmentation to balance the dataset we use. Our approach, although simple, outperforms the best results reported for the paraphrase detection task in Portuguese, showing that graph structures may capture better the semantic relatedness among sentences.

Highlights

  • Paraphrase detection is a Natural-Language Processing (NLP) task that aims to automatically identify whether two sentences convey the same meaning

  • Inverse Frequency (SIF) [20], and weighted aggregation based on Inverse Document Frequency (IDF)

  • We detailed the developed methods for paraphrase identification and our strategy to mitigate the unbalance of the ASSIN corpus

Read more

Summary

Introduction

Paraphrase detection is a Natural-Language Processing (NLP) task that aims to automatically identify whether two sentences convey the same meaning. The existing works that aim to detect paraphrase sentences in Portuguese [3,10], model this task as a machine-learning solution, building feature-value tables and training and testing classifiers. The authors apply sampling techniques to mitigate the unbalance issues of the ASSIN corpus, aiming to get more balanced data to improve the results of their models. Other strategies that make use of synthetic data suffer from criticism on the quality of the generated data To fulfill these gaps and explore other approaches for paraphrase detection, in this paper, inspired by Sousa et al [13], we model the paraphrase detection task as a heterogeneous network.

Related Work
The ASSIN Corpus
The MSRP Corpus
Balancing the ASSIN Corpus
Modeling the Paraphrase Identification Task
Formulating the Paraphrase Identification Task
Experiments and Results
Method
Final Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call