Text Reuse Research Articles

A number of metrics have been proposed in the literature to measure text re-use between pairs of sentences or short passages. These individual metrics fail to reliably detect paraphrasing or semantic equivalence between sentences, due to the subjectivity and complexity of the task, even for human beings. This paper analyzes a set of five simple but weak lexical metrics for measuring textual similarity and presents a novel paraphrase detector with improved accuracy based on abductive machine learning. The objective here is 2-fold. First, the performance of each individual metric is boosted through the abductive learning paradigm. Second, we investigate the use of decision-level and feature-level information fusion via abductive networks to obtain a more reliable composite metric for additional performance enhancement. Several experiments were conducted using two benchmark corpora and the optimal abductive models were compared with other approaches. Results demonstrate that applying abductive learning has significantly improved the results of individual metrics and further improvement was achieved through fusion. Moreover, building simple models of polynomial functional elements that identify and integrate the smallest subset of relevant metrics yielded better results than those obtained from the support vector machine classifiers utilizing the same datasets and considered metrics. The results were also comparable to the best result reported in the literature even with larger number of more powerful features and/or using more computationally intensive techniques.

Read full abstract

To paraphrase means to rewrite content while preserving the original meaning. Paraphrasing is important in fields such as text reuse in journalism, anonymizing work, and improving the quality of customer-written reviews. This article contributes to paraphrase acquisition and focuses on two aspects that are not addressed by current research: (1) acquisition via crowdsourcing, and (2) acquisition of passage-level samples. The challenge of the first aspect is automatic quality assurance; without such a means the crowdsourcing paradigm is not effective, and without crowdsourcing the creation of test corpora is unacceptably expensive for realistic order of magnitudes. The second aspect addresses the deficit that most of the previous work in generating and evaluating paraphrases has been conducted using sentence-level paraphrases or shorter; these short-sample analyses are limited in terms of application to plagiarism detection, for example. We present the Webis Crowd Paraphrase Corpus 2011 (Webis-CPC-11), which recently formed part of the PAN 2010 international plagiarism detection competition. This corpus comprises passage-level paraphrases with 4067 positive samples and 3792 negative samples that failed our criteria, using Amazon's Mechanical Turk for crowdsourcing. In this article, we review the lessons learned at PAN 2010, and explain in detail the method used to construct the corpus. The empirical contributions include machine learning experiments to explore if passage-level paraphrases can be identified in a two-class classification problem using paraphrase similarity features, and we find that a k-nearest-neighbor classifier can correctly distinguish between paraphrased and nonparaphrased samples with 0.980 precision at 0.523 recall. This result implies that just under half of our samples must be discarded (remaining 0.477 fraction), but our cost analysis shows that the automation we introduce results in a 18% financial saving and over 100 hours of time returned to the researchers when repeating a similar corpus design. On the other hand, when building an unrelated corpus requiring, say, 25% training data for the automated component, we show that the financial outcome is cost neutral, while still returning over 70 hours of time to the researchers. The work presented here is the first to join the paraphrasing and plagiarism communities.

Read full abstract

Text Reuse Research Articles

Related Topics

Articles published on Text Reuse

Comparing and combining Content‐ and Citation‐based approaches for plagiarism detection

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach

Patterns of text reuse in a scientific corpus

Boosting paraphrase detection through textual similarity metrics with abductive networks

The sense of a connection: Automatic tracing of intertextuality by meaning

Copyright Issues Related to the Implementation of Open Access Policies

Historical Relevance Feedback Detection by Text Re-use Networks

Paraphrase acquisition via crowdsourcing and machine learning

Never Coming to a Theatre near You: Recut Film Trailers

When Does Previous Disclosure Become a “Prior Publication” Problem?

The Bottom Line: Does Text Reuse Translate into Gains in Productivity?

Re-Using Text from One's Own Previously Published Papers: An Exploratory Study of Potential Self-Plagiarism

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Reuse Research Articles

Related Topics

Articles published on Text Reuse

Comparing and combining Content‐ and Citation‐based approaches for plagiarism detection

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

Tracing the Flow of Policy Ideas in Legislatures: A Text Reuse Approach

Patterns of text reuse in a scientific corpus

Boosting paraphrase detection through textual similarity metrics with abductive networks

The sense of a connection: Automatic tracing of intertextuality by meaning

Copyright Issues Related to the Implementation of Open Access Policies

Historical Relevance Feedback Detection by Text Re-use Networks

Paraphrase acquisition via crowdsourcing and machine learning

Never Coming to a Theatre near You: Recut Film Trailers

When Does Previous Disclosure Become a “Prior Publication” Problem?

The Bottom Line: Does Text Reuse Translate into Gains in Productivity?

Re-Using Text from One's Own Previously Published Papers: An Exploratory Study of Potential Self-Plagiarism