Paraphrase acquisition via crowdsourcing and machine learning

Steven Burrows,Martin Potthast,Benno Stein

doi:10.1145/2483669.2483676

Abstract

To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18% financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25% training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Paraphrase acquisition via crowdsourcing and machine learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Intelligent Systems and Technology

Lead the way for us

Journal: ACM Transactions on Intelligent Systems and Technology	Publication Date: Jun 1, 2013
Citations: 72

Similar Papers

Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection

-

24 Jun 2020
24 Jun 2020

On the mono- and cross-language detection of text reuse and plagiarism
Alberto Barrón-Cedeño
-
Alberto Barrón-CedeñoAlberto Barrón-Cedeño
19 Jul 2010
19 Jul 2010

A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu
Muhammad Haseeb ... Adnan Abid
Data in Brief | VOL. 52
Muhammad Haseeb, et. al.Muhammad Haseeb ... Adnan Abid
26 Nov 2023
Data in Brief | VOL. 52

Task decomposition based on class relations: A modular neural network architecture for pattern classification
Bao-Liang Lu ... Masami Ito
-
Bao-Liang Lu, et. al.Bao-Liang Lu ... Masami Ito
01 Jan 1997
01 Jan 1997

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Paraphrase acquisition via crowdsourcing and machine learning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Intelligent Systems and Technology