Corpus-Based Paraphrase Detection Experiments and Review

Tedo Vrbanec,Ana Meštrović

doi:10.3390/info11050241

Abstract

Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection—where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

Highlights

Paraphrasing is the process of rewriting text to change the form and expression while retaining its original meaning
In this research we compare eight models; namely, LSI, Term-frequency-inverse document frequency (TF-IDF), Word2Vec, Doc2Vec, GloVe, FastText, embeddings from language models (ELMO) [8], and Universal sentence encoder (USE) [9]. We evaluated these models in terms of accuracy, precision, recall, and F1 measure on three different public available corpora: Microsoft Research Paraphrase Corpus (MSRP), Clough and Stevenson (C&S), and Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11)
Threshold values were calculated from the training parts of datasets and evaluated on the testing parts using the 3-cross validation method for Webis; for MSRP we used predefined train and test datasets; and for C&S we used the 5-cross validation method because of the small corpus and the fact that there are five topics in it

Summary

Introduction

Paraphrasing is the process of rewriting text to change the form and expression while retaining its original meaning. Automatic paraphrase detection has an important role in the various tasks, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. The somewhat more general task of the measuring of the semantic similarity of texts is significant in the domain of natural language processing (NLP). Some of the existing paraphrase systems have performed quite well; there are certain challenges with paraphrase detection. Existing paraphrase systems deliver relatively good results for clean texts, but they do not perform well when applied to noisy texts [1,2,3]. In recent years there was an expansion of deep neural network models’ application to the NLP domain, and that opens up a complete new field for experimentation and improvement of the existing approaches

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Apr 29, 2020
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Corpus-Based Paraphrase Detection Experiments and Review

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

Abstract 184: The utility of deep metric learning for breast cancer identification on mammographic images
Justin Du ... Sanjay Aneja
Cancer Research | VOL. 81
Justin Du, et. al.Justin Du ... Sanjay Aneja
01 Jul 2021
Cancer Research | VOL. 81

P–260 Towards better explainable deep learning models for embryo selection in ART
...
Human Reproduction | VOL. 36
, et. al. ...
06 Aug 2021
Human Reproduction | VOL. 36

Explainable artificial intelligence (XAI) for predicting the need for intubation in methanol-poisoned patients: a study comparing deep and machine learning models
Khadijeh Moulaei ... Mitra Rahimi
Scientific Reports | VOL. 14
Khadijeh Moulaei, et. al.Khadijeh Moulaei ... Mitra Rahimi
08 Jul 2024
Scientific Reports | VOL. 14

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection
Tedo Vrbanec ... Ana Meštrović
Expert Systems | VOL. 40
Tedo Vrbanec, et. al.Tedo Vrbanec ... Ana Meštrović
22 Jun 2023
Expert Systems | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Corpus-Based Paraphrase Detection Experiments and Review

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information