Abstract
This paper explores the detection of derivation links between texts (otherwise called plagiarism, near-duplication, revision, etc.) at the document level. We evaluate the use of textual elements implementing the ideas of specificity and invariance as well as their combination to characterize derivatives. We built a French press corpus based on Wikinews revisions to run this evaluation. We obtain performances similar to the state of the art method (n-grams overlap) while reducing the signature size and so, the processing costs. In order to ensure the verifiability and the reproducibility of our results we make our code as well as our corpus available to the community.
Highlights
B EING in the age of information, the information is produced and duplicated, revised and plagiarized at some extent
We address the task of detecting text derivatives of a given source document among a collection of suspicious documents, i.e. given a collection of suspicious and source documents, one must map the first to the second detecting the derivation links involving a suspicious and a source
We provide a derivation corpus with revision relation for press texts which constitutes a concrete contribution to the scientific community since no resource were available for studying derivation in French
Summary
B EING in the age of information, the information is produced and duplicated, revised and plagiarized at some extent. This redundancy is an hindrance to Information Retrieval (IR) methods in terms of computation, storage and results. We address the task of detecting text derivatives of a given source document among a collection of suspicious documents, i.e. given a collection of suspicious and source documents, one must map the first to the second detecting the derivation links involving a suspicious and a source This task is usually handled by measuring the n-grams overlap between sources and suspicious. We compare the performances of our propositions to this baseline
Full Text
Topics from this Paper
Processing Costs
French Corpus
Press Corpus
French Press
Signature Size
+ Show 2 more
Create a personalized feed of these topics
Get StartedTalk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
MedieKultur: Journal of media and communication research
Nov 2, 2017
Palgrave Communications
May 1, 2020
Information and Software Technology
Mar 1, 2012
IEEE Transactions on Wireless Communications
Apr 1, 2022
La linguistique
Nov 21, 2016
Polibits
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jan 31, 2018
Polibits
Jul 31, 2016