A Corpus for Automatic Readability Assessment and Text Simplification of German

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German, the first of its kind for this language. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.

Similar Papers
  • Conference Article
  • Cite Count Icon 8
  • 10.26615/978-954-452-056-4_131
Automated Text Simplification as a Preprocessing Step for Machine Translation into an Under-resourced Language
  • Oct 22, 2019
  • Sanja Štajner + 1 more

In this work, we investigate the possibility of using fully automatic text simplification system on the English source in machine translation (MT) for improving its translation into an under-resourced language. We use the state-of-the-art automatic text simplification (ATS) system for lexically and syntactically simplifying source sentences, which are then translated with two state-of-the-art English-to-Serbian MT systems, the phrase-based MT (PBMT) and the neural MT (NMT). We explore three different scenarios for using the ATS in MT: (1) using the raw output of the ATS; (2) automatically filtering out the sentences with low grammaticality and meaning preservation scores; and (3) performing a minimal manual correction of the ATS output. Our results show improvement in fluency of the translation regardless of the chosen scenario, and difference in success of the three scenarios depending on the MT approach used (PBMT or NMT) with regards to improving translation fluency and post-editing effort.

  • Research Article
  • Cite Count Icon 106
  • 10.1145/2738046
Making It Simplext
  • May 11, 2015
  • ACM Transactions on Accessible Computing
  • Horacio Saggion + 5 more

The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. In this article, we present results from the Simplext project, which is dedicated to automatic text simplification for Spanish. We present a modular system with dedicated procedures for syntactic and lexical simplification that are grounded on the analysis of a corpus manually simplified for people with special needs. We carried out an automatic evaluation of the system’s output, taking into account the interaction between three different modules dedicated to different simplification aspects. One evaluation is based on readability metrics for Spanish and shows that the system is able to reduce the lexical and syntactic complexity of the texts. We also show, by means of a human evaluation, that sentence meaning is preserved in most cases. Our results, even if our work represents the first automatic text simplification system for Spanish that addresses different linguistic aspects, are comparable to the state of the art in English Automatic Text Simplification.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 112
  • 10.18653/v1/w18-0535
OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification
  • Jan 1, 2018
  • Sowmya Vajjala + 1 more

This paper describes the collection and compilation of the OneStopEnglish corpus of texts written at three reading levels, and demonstrates its usefulness for through two applications - automatic readability assessment and automatic text simplification. The corpus consists of 189 texts, each in three versions (567 in total). The corpus is now freely available under a CC by-SA 4.0 license and we hope that it would foster further research on the topics of readability assessment and text simplification.

  • Research Article
  • Cite Count Icon 4
  • 10.1162/tacl_a_00653
Do Text Simplification Systems Preserve Meaning? A Human Evaluation via Reading Comprehension
  • Apr 16, 2024
  • Transactions of the Association for Computational Linguistics
  • Sweta Agrawal + 1 more

Automatic text simplification (TS) aims to automate the process of rewriting text to make it easier for people to read. A pre-requisite for TS to be useful is that it should convey information that is consistent with the meaning of the original text. However, current TS evaluation protocols assess system outputs for simplicity and meaning preservation without regard for the document context in which output sentences occur and for how people understand them. In this work, we introduce a human evaluation framework to assess whether simplified texts preserve meaning using reading comprehension questions. With this framework, we conduct a thorough human evaluation of texts by humans and by nine automatic systems. Supervised systems that leverage pre-training knowledge achieve the highest scores on the reading comprehension tasks among the automatic controllable TS systems. However, even the best-performing supervised system struggles with at least 14% of the questions, marking them as “unanswerable” based on simplified content. We further investigate how existing TS evaluation metrics and automatic question-answering systems approximate the human judgments we obtained.

  • Book Chapter
  • 10.1007/978-3-031-02166-4_2
Readability and Text Simplification
  • Jan 1, 2017
  • Synthesis lectures on human language technologies
  • Horacio Saggion

A key question in text simplification research is the identification of the complexity of a given text so that a decision can be made on whether or not to simplify it. Identifying the complexity of a text or sentence can help assess whether the output produced by a text simplification system matches the reading ability of the target reader. It can also be used to compare different systems in terms of complexity or simplicity of the produced output. There are a number of very complete surveys on the relevant topic of text readability which can be understood as “what makes some texts easier to read than others” [Benjamin, 2012, Collins-Thompson, 2014, DuBay, 2004]. Text readability, which has been investigated for a long time in academic circles, is very close to the “to simplify or not to simplify” question in automatic text simplification. Text readability research has often attempted to devise mechanical methods to assess the reading difficulty of a text so that it can be objectively measured. Classical mechanical text readability formulas combine a number of proxies to obtain a numerical score indicative of the difficulty of a text. These scores could be used to place the texts in an appropriate grade level or used to sort text by difficulty.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/bip56202.2022.10032482
Towards Text Simplification in Spanish: A Brief Overview of Deep Learning Approaches for Text Simplification
  • Nov 15, 2022
  • Mario Romero + 5 more

Text simplification refers to the transformation of a specific source text into a target text aiming to increase understanding and readability for one or more specific audiences. This task demands large human efforts and specialized knowledge, which makes the usage of automated or semi-automated computational approaches appealing. The rise of deep learning as an unifying paradigm between seemingly different fields as image analysis, sound processing and natural language processing has considerably influenced the current state of the art approaches for automatic text simplification. Therefore, in this work, we focus on the study of deep learning based state of the art methods for automatic text simplification in the Spanish language. For this end, we first disentangle the different tasks which can be addressed in order to yield a simplified text in general. Later we review the latest deep learning-based approaches, along with the main datasets and performance metrics used in the field. We also describe approaches to deal with small datasets and technical words. Finally, we describe some lessons to build accurate automatic text simplification systems in Spanish, as in this language there is a noticeable shortage of work for text simplification.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.3389/frai.2023.1223924
MeaningBERT: assessing meaning preservation between sentences
  • Sep 22, 2023
  • Frontiers in Artificial Intelligence
  • David Beauchemin + 2 more

In the field of automatic text simplification, assessing whether or not the meaning of the original text has been preserved during simplification is of paramount importance. Metrics relying on n-gram overlap assessment may struggle to deal with simplifications which replace complex phrases with their simpler paraphrases. Current evaluation metrics for meaning preservation based on large language models (LLMs), such as BertScore in machine translation or QuestEval in summarization, have been proposed. However, none has a strong correlation with human judgment of meaning preservation. Moreover, such metrics have not been assessed in the context of text simplification research. In this study, we present a meta-evaluation of several metrics we apply to measure content similarity in text simplification. We also show that the metrics are unable to pass two trivial, inexpensive content preservation tests. Another contribution of this study is MeaningBERT (https://github.com/GRAAL-Research/MeaningBERT), a new trainable metric designed to assess meaning preservation between two sentences in text simplification, showing how it correlates with human judgment. To demonstrate its quality and versatility, we will also present a compilation of datasets used to assess meaning preservation and benchmark our study against a large selection of popular metrics.

  • Video Transcripts
  • 10.48448/19gd-3934
Portuguese Neural Text Simplification using Machine Translation
  • Nov 16, 2021
  • Underline Science Inc.
  • Rafael Mello + 5 more

Automatic Text Simplification (ATS) has played a significant role in the Natural Language Processing (NLP) field. ATS is a sequence-to-sequence problem aiming to create a new version of the original text removing complex and domain-specific words. It can improve communication and understanding of documents from specific domains, as well as support second language learning. This paper presents an empirical study on the use of state-of-the-art ATS methods to simplify texts in Portuguese. It is important to remark that the literature reports the challenge in analyzing Portuguese texts due to the lack of resources compared to other languages (i.e., English). More specifically, this work evaluated different Neural Machine Translation (NMT) techniques for ATS in Portuguese. The experiments showed that NMT achieved promising results in Portuguese texts, obtaining 40.89 BLEU score using multiple parallel corpora and raising the overall readability score by more than 5 points.

  • Research Article
  • Cite Count Icon 298
  • 10.1111/j.1540-4781.2007.00507.x
A Linguistic Analysis of Simplified and Authentic Texts
  • Feb 16, 2007
  • The Modern Language Journal
  • Scott A Crossley + 3 more

The opinions of second language learning (L2) theorists and researchers are divided over whether to use authentic or simplified reading texts as the means of input for beginning‐ and intermediate‐level L2 learners. Advocates of both approaches cite the use of linguistic features, syntax, and discourse structures as important elements in support of their arguments, but there has been no conclusive study that measures these differences and their implications for L2 learning. The purpose of this article is to provide an exploratory study that fills this gap. Using the computational tool Coh‐Metrix, this study investigates the differences between the linguistic structures of sampled simplified texts and those of authentic reading texts in order to provide a better understanding of the linguistic features that comprise these text types. The findings demonstrate that these texts differ significantly, but not always in the manner supposed by the authors of relevant scholarship. This research is meant to enable material developers, publishers, and classroom teachers to judge more accurately the value of both authentic and simplified texts.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 23
  • 10.18653/v1/w18-7005
Reference-less Quality Estimation of Text Simplification Systems
  • Jan 1, 2018
  • Louis Martin + 5 more

The evaluation of text simplification (TS) systems remains an open challenge. As the task has common points with machine translation (MT), TS is often evaluated using MT metrics such as BLEU. However, such metrics require high quality reference data, which is rarely available for TS. TS has the advantage over MT of being a monolingual task, which allows for direct comparisons to be made between the simplified text and its original version. In this paper, we compare multiple approaches to reference-less quality estimation of sentence-level text simplification systems, based on the dataset used for the QATS 2016 shared task. We distinguish three different dimensions: gram-maticality, meaning preservation and simplicity. We show that n-gram-based MT metrics such as BLEU and METEOR correlate the most with human judgment of grammaticality and meaning preservation, whereas simplicity is best evaluated by basic length-based metrics.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/3663548.3675645
Design and Evaluation of an Automatic Text Simplification Prototype with Deaf and Hard-of-hearing Readers
  • Oct 27, 2024
  • Oliver Alonzo + 5 more

Research has observed benefits from providing lexical and syntactic approaches to Automatic Text Simplification (ATS) to Deaf and Hard-of-hearing (DHH) readers. However, little research has explored DHH readers’ design preferences and interactions with these approaches. This work first explores the design space of ATS systems with DHH readers, identifying potential design configurations for evaluation. Open-ended discussion of participants’ design preferences reveal values informing those preferences, including maintaining reading fluency and efficiency, and control over the tool. Using popular design choices from our formative study, we evaluated a prototype that provides various simplification types to explore DHH readers’ interactions with the system. We observed potential conflicts between participants’ values and design preferences, such as the prototype’s impact on participants’ reading speed and participants’ perceived need to reread simplifications suggested by the tool. However, participants found the tool useful, showing a nuanced preference towards world-level lexical simplifications using pop-ups. Our findings highlight the importance of the tool’s design on users’ reading experiences, and provide implications for the design and evaluation of ATS prototypes with target readers.

  • Research Article
  • Cite Count Icon 209
  • 10.1075/itl.165.2.06sid
A survey of research on text simplification
  • Dec 31, 2014
  • ITL - International Journal of Applied Linguistics
  • Advaith Siddharthan

Text simplification, defined narrowly, is the process of reducing the linguistic complexity of a text, while still retaining the original information and meaning. More broadly, text simplification encompasses other operations; for example, conceptual simplification to simplify content as well as form, elaborative modification, where redundancy and explicitness are used to emphasise key points, and text summarisation to omit peripheral or inappropriate information. There is substantial evidence that manual text simplification is an effective intervention for many readers, but automatic simplification has only recently become an established research field. There have been several recent papers on the topic, however, which bring to the table a multitude of methodologies, each with their strengths and weaknesses. The goal of this paper is to summarise the large interdisciplinary body of work on text simplification and highlight the most promising research directions to move the field forward.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-031-35320-8_5
A Review of Parallel Corpora for Automatic Text Simplification. Key Challenges Moving Forward
  • Jan 1, 2023
  • Tania Josephine Martin + 2 more

This review of parallel corpora for automatic text simplification (ATS) involves an analysis of forty-nine papers wherein the corpora are presented, focusing on corpora in the Indo-European languages of Western Europe. We improve on recent corpora reviews by reporting on the target audience of the ATS, the language and domain of the source text, and other metadata for each corpus, such as alignment level, annotation strategy, and the transformation applied to the simplified text. The key findings of the review are: 1) the lack of resources that address ATS aimed at domains which are important for social inclusion, such as health and public administration; 2) the lack of resources aimed at audiences with mild cognitive impairment; 3) the scarcity of experiments where the target audience was directly involved in the development of the corpus; 4) more than half the proposals do not include any extra annotation, thereby lacking detail on how the simplification was done, or the linguistic phenomenon tackled by the simplification; 5) other types of annotation, such as the type and frequency of the transformation applied could identify the most frequent simplification strategies; and, 6) future strategies to advance the field of ATS could leverage automatic procedures to make the annotation process more agile and efficient.

  • Research Article
  • Cite Count Icon 7
  • 10.1093/applin/amac057
The Effect of Automatic Text Simplification on L2 Readers’ Text Comprehension
  • Oct 19, 2022
  • Applied Linguistics
  • Dennis Murphy Odo

Texts used in L2 classrooms have traditionally been simplified manually, but recent technological advances allow us to investigate whether automatic text simplification (ATS) software can help L2 learners comprehend texts in second and foreign languages. Participants were divided into low and high L2 reading proficiency groups and assigned to read either the authentic or automatically simplified version of a text and completed a free recall task and MC comprehension test. The results did not show any significant correlations among the variables of topic knowledge, topic interest, and MC comprehension, but there were correlations among L2 reading comprehension, MC comprehension, and free recall results. Results also showed that the automatically simplified text facilitated the comprehension of the more proficient readers but not the less proficient readers according to their performance on the free recall assessment. Implications are that L2 teachers cannot blindly use whatever text they want with ATS, and ATS software designers may need to reconsider the current conservative approach to simplification that many ATS tools use.

  • Research Article
  • 10.1007/s10579-025-09879-4
A comparative study of sentence alignment methods for Spanish text simplification
  • Mar 3, 2026
  • Language Resources and Evaluation
  • Christina Niklaus + 3 more

Millions of people worldwide face barriers in accessing and understanding complex written information due to limited literacy. Automatic text simplification (ATS) addresses this challenge by transforming complex texts into simpler, more accessible versions. However, most existing ATS research focuses on English, leaving Spanish, a language spoken by over 500 million people, underrepresented. This paper fills this gap by introducing large-scale sentence-aligned simplification resources for Spanish, developed from the Newsela and ClearSim corpora. We propose detailed guidelines for manual alignment, evaluate a wide range of automatic sentence alignment algorithms, and present the first systematic exploration of LLM-based monolingual sentence alignment in Spanish. Our analysis incorporates comprehensive quantitative and qualitative evaluation, supported by statistical significance testing, and reveals clear differences in the structural simplification patterns across corpora. In addition, we train and release baseline ATS models using the new aligned datasets, demonstrating their practical utility for downstream simplification. All alignment code, trained models, and evaluation scripts will be publicly released to ensure transparency and reproducibility. Together, these contributions substantially advance the resources and methodology for Spanish-language ATS.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant