Text Reuse Research Articles

We have built a suite of tools in Python to proficiently analyze text reuse and intertextuality for a specific kind of set of medieval Arabic texts (commentaries) available in print. We take these printed editions, scan them, pre-process the images, give it to an OCR engine, clean the results, and store it in a data structure that mimics the explicit intertextual relation the texts have, and continue to perform data analysis on it. Digital approaches to medieval Arabic texts have either been at the micro-level in what has become known as a ‘digital edition’, i.e. the digital representation of one text, densely annotated, most commonly in TEI-XML, or it has been done at the macro-level in what is called a ‘digital corpus’, consisting of thousands of loosely encoded and sparsely annotated plain text files, accompanied by an entire infrastructure and high-performing software to perform broadly scoped queries. The micro-level generally is at the level of tens of thousands of words while the macro-level can be at the level of over a billion words. The micro-level is explicitly designed to be human readable first, while the macro-level is built to be machine readable first. At the micro-level, every little detail needs to be correct and in order, while at the macro-level a fairly large margin of error is still negligible as a mere rounding error. Amidst these levels we have been seeking a meso-level of digital analysis: neither edition nor corpus, but rather a group of texts at the level of hundreds of thousands to millions of words, with a small but perceptible margin of error, and a light but noticeable level of annotations, principally geared towards machine readability, but with ample opportunity for visual inspection and manual correction. In this paper we explain the rationale for our approach, the technical achievements it has led us to, and the results we so far obtained.

Read full abstract

Text similarity analysis entails studying identical and closely similar text passages across large corpora, with a particular focus on intentional and unintentional borrowing patterns. At a larger scale, detecting repeated passages takes on added importance, as the same text can convey different meanings in different contexts. This approach offers numerous benefits, enhancing intellectual and literary scholarship by simplifying the identification of textual overlaps. Consequently, scholars can focus on the theoretical aspects of reception with an expanded corpus of evidence at their disposal. This article adds to the expanding field of historical text reuse, applying it to intellectual history and showcasing its utility in examining reception, influence, popularity, authorship attribution, and the development of tools for critical editions. Focused on the works and various editions of Bernard Mandeville (1670–1733), the research applies comparative text similarity analysis to explore his borrowing habits and the reception of his works. Systematically examining text reuses across several editions of Mandeville’s works, it provides insights into the evolution of his output and influences over time. The article adopts a forward-looking perspective in historical research, advocating for the integration of archival and statistical evidence. This is illustrated through a detailed examination of the attribution of&nbsp;Publick Stews&nbsp;to Mandeville. Analysing cumulative negative evidence of borrowing patterns suggests that Mandeville might not have been the author of the piece. However, the article aims not to conclude the debate but rather to open it up, underscoring the importance of taking such evidence into consideration. Additionally, it encourages scholars to incorporate text reuse evidence when exploring other cases in early modern scholarship. This highlights the adaptability and scalability of text similarity analysis as a valuable tool for advancing literary studies and intellectual history.

Read full abstract

Text Reuse Research Articles

Related Topics

Articles published on Text Reuse

Mono-lingual text reuse detection for the Urdu language at lexical level

Minha Terra Tem ____________: Patterns of Text Reuse in “Song of Exile” and its Intertexts

Crediting Invisible Work: Congress and the Lawmaking Productivity Metric (LawProM)

Sypung på rymmen

Neither Corpus Nor Edition: Building a Pipeline to Make Data Analysis Possible on Medieval Arabic Commentary Traditions

Textuality as amplification: reconsidering close reading and distant reading in cultural history

Not just rubber-stamping: understanding the amending role of the Chinese legislature with bill text reuse

Transparent generosity. Introducing the impresso interface for the exploration of semantically enriched historical newspapers

Documentary Formulae as Text Reuse Templates: <i>Constat</i> and <i>Manifestus</i> Clauses in Early Medieval Latin Charters

A Comparative text similarity analysis of the works of Bernard Mandeville

Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach

Impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers.

Content Generation in the Age of Mechanical Reproduction

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Detecting translation borrowings in huge text collections using various methods

Detecting the influence of the Chinese guiding cases: a text reuse approach

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

Reception Reader: Exploring Text Reuse in Early Modern British Publications

A large dataset of scientific text reuse in Open-Access publications

Textual Migration Across the Baltic Sea

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Reuse Research Articles

Related Topics

Articles published on Text Reuse

Mono-lingual text reuse detection for the Urdu language at lexical level

Minha Terra Tem ____________: Patterns of Text Reuse in “Song of Exile” and its Intertexts

Crediting Invisible Work: Congress and the Lawmaking Productivity Metric (LawProM)

Sypung på rymmen

Neither Corpus Nor Edition: Building a Pipeline to Make Data Analysis Possible on Medieval Arabic Commentary Traditions

Textuality as amplification: reconsidering close reading and distant reading in cultural history

Not just rubber-stamping: understanding the amending role of the Chinese legislature with bill text reuse

Transparent generosity. Introducing the impresso interface for the exploration of semantically enriched historical newspapers

Documentary Formulae as Text Reuse Templates: &lt;i&gt;Constat&lt;/i&gt; and &lt;i&gt;Manifestus&lt;/i&gt; Clauses in Early Medieval Latin Charters

A Comparative text similarity analysis of the works of Bernard Mandeville

Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approach

Impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers.

Content Generation in the Age of Mechanical Reproduction

Cross-lingual Text Reuse Detection at Document Level for English-Urdu Language Pair

Detecting translation borrowings in huge text collections using various methods

Detecting the influence of the Chinese guiding cases: a text reuse approach

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

Reception Reader: Exploring Text Reuse in Early Modern British Publications

A large dataset of scientific text reuse in Open-Access publications

Textual Migration Across the Baltic Sea

Documentary Formulae as Text Reuse Templates: <i>Constat</i> and <i>Manifestus</i> Clauses in Early Medieval Latin Charters