Paraphrase plagiarism identification with character-level features

Fernando Sánchez-Vega,Luis Villaseñor-Pineda,Efstathios Stamatatos,Paolo Rosso,Esaú Villatoro-Tello,Manuel Montes-Y-Gómez

doi:10.1007/s10044-017-0674-z

Abstract

Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identification consists in automatically recognizing document fragments that contain reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes and morphological or lexical substitutions. Our main hypothesis establishes that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identification, which represents a new valuable resource to the NLP community for future research work in this field.

Full Text