Plain Italian and AI: Strengths and weaknesses of automatic linguistic simplification
Abstract The simplification of language – particularly with regard to administrative discourse – has long been a central concern within Italian linguistics. Over the past few decades, significant progress has been made, including the development of consolidated and widely accepted lists of linguistic features – both morphosyntactic and lexical – that influence textual simplicity and accessibility (cf. Fiorentino/Ganfi 2024). These advances contributed to the early creation of a readability index, the Gulpease index , in the 1980 s (cf. Lucisano/Piemontese 1988). Within this framework, the authors have developed a software for the automatic simplification of administrative texts, supported by QWEN3 (a large language model, LLM), entitled SEMPL-IT (cf. Russodivito et al. 2024; Fiorentino/Russodivito 2025; Ganfi/Russodivito 2025; Fiorentino et al. forthcoming; Fiorentino/Russodivito forthcoming). As part of this project, a corpus named ItaIst (Fiorentino et al. 2024b) The ItaIst corpus is publicly available on Hugging Face at the following link: https://huggingface.co/datasets/VerbACxSS/ItaIst (15 July 2025). was compiled and subjected to automatic simplification using the BASIC approach , resulting in a parallel corpus of simplified texts. This simplified corpus was then compared to the source corpus and evaluated in terms of improved readability and Semantic similarity (cf. Chandrasekaran et al. 2021), with the objective of validating the effectiveness of the simplification process. In this contribution, we introduce and validate a new methodology – the CHAIN approach – applied to a different corpus, ItaRegol (Fiorentino et al. 2024a). The ItaRegol corpus is publicly available on Hugging Face at the following link: https://huggingface.co/datasets/VerbACxSS/ItaRegol (15 July 2025). Although smaller in size than ItaIst , ItaRegol comprises rules and regulations, i. e., legally binding texts that create, modify, or extinguish subjective legal positions. Due to the legal nature of these texts, simplification must be carried out with caution to avoid altering their legal effects. This paper compares the two simplification approaches – BASIC and CHAIN – by evaluating the parameters adopted, assessing the quality of the simplified output, and drawing conclusions regarding the differing impact of these strategies in enhancing the readability of administrative versus regulatory texts.
- Research Article
2
- 10.1145/3744744
- Jun 19, 2025
- ACM Transactions on Intelligent Systems and Technology
Sentence simplification, which rewrites a sentence to be easier to read and understand, is a promising technique to help people with various reading difficulties. With the rise of advanced large language models (LLMs), evaluating their performance in sentence simplification has become imperative. Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs’ simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models’ performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation’s reliability. To address these problems, this study provides in-depth insights into LLMs’ performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs’ simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4’s struggles with lexical paraphrasing. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4’s and Qwen2.5-72B’s struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs 1 .
- Research Article
106
- 10.1145/2738046
- May 11, 2015
- ACM Transactions on Accessible Computing
The way in which a text is written can be a barrier for many people. Automatic text simplification is a natural language processing technology that, when mature, could be used to produce texts that are adapted to the specific needs of particular users. Most research in the area of automatic text simplification has dealt with the English language. In this article, we present results from the Simplext project, which is dedicated to automatic text simplification for Spanish. We present a modular system with dedicated procedures for syntactic and lexical simplification that are grounded on the analysis of a corpus manually simplified for people with special needs. We carried out an automatic evaluation of the system’s output, taking into account the interaction between three different modules dedicated to different simplification aspects. One evaluation is based on readability metrics for Spanish and shows that the system is able to reduce the lexical and syntactic complexity of the texts. We also show, by means of a human evaluation, that sentence meaning is preserved in most cases. Our results, even if our work represents the first automatic text simplification system for Spanish that addresses different linguistic aspects, are comparable to the state of the art in English Automatic Text Simplification.
- Research Article
2
- 10.1200/jco.2024.42.16_suppl.e13609
- Jun 1, 2024
- Journal of Clinical Oncology
e13609 Background: Precision oncology revolutionized cancer treatment by identifying molecular biomarkers to guide personalized care. The ever-growing body of medical literature presents a challenge for oncologists researching targeted therapies. While recent studies investigated large language models (LLMs) to streamline this process, LLM reliance on general rather than medical knowledge limits clinical relevance and trustworthiness. To address these limitations, we developed a retrieval augmented generation (RAG) system that integrates PubMed clinical studies, trial databases and oncological guidelines with LLMs to support targeted treatment recommendations. The Molecular Tumor Board (MTB) at the Center of Personalized Medicine (ZPMTUM) guided and evaluated treatment options proposed by the LLM to assess their applicability for clinical decision support. Methods: We used 10 publicly accessible fictional patient cases with 7 tumor types and 59 distinct molecular alterations. Our LLM system MEREDITH (Medical Evidence Retrieval and Data Integration for Tailored Healthcare) consists of Google's Gemini Pro, enhanced with RAG and Chain-of-Thought (CoT) prompting. To establish a benchmark, clinical experts at ZPMTUM manually annotated the cases. Informed by MTB expert feedback, we iteratively improved our LLM system from a draft system relying on PubMed-indexed data to an enhanced system, which replicated expert annotation processes by incorporating oncology guidelines, drug availability and trial databases (ClinicalTrials.gov, QuickQueck.de). ZPMTUM assessed credibility and clinical relevance of manually annotated and LLM-generated recommendations. Patient-level data on (likely) pathogenic molecular alterations and recommended treatment options were summarized using median and interquartile range (IQR). Semantic similarity between LLM and clinician responses was assessed using cosine similarity of text vector embeddings; paired t-test evaluated significance. Results: The median of (likely) pathogenic molecular alterations per patient was 2.5 (IQR: 2-3). ZPMTUM identified a median of 2 treatment options per patient (IQR: 1-3), while the enhanced LLM identified a median of 4 (IQR: 3-5). MEREDITH proposed multiple relevant treatment suggestions, including therapies based on preclinical studies, and molecular interactions, for further assessment by the MTB. ZPMTUM prioritized the most suitable clinical option. The mean semantic textual similarity of LLM responses increased significantly from 0.69 in the draft system to 0.76 in the enhanced system (p <0.001). Thus, feedback from ZPMTUM enhanced the model's ability to align its responses with clinician thought processes. Conclusions: Leveraging expert thought processes to instruct LLMs holds promise as a novel decision support tool for precision oncology.
- Conference Article
4
- 10.1145/3726302.3730304
- Jul 13, 2025
The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.
- Research Article
- 10.1111/exd.70175
- Nov 1, 2025
- Experimental dermatology
Large language models (LLMs) have been explored in various dermato-oncological conditions. In this study, we aimed to compare different LLMs' potential to guide clinicians on the treatment of basal cell carcinoma (BCC). Four authors formulated 24 questions on the topic of clinical management of BCC. The blinded responses of three LLMs (Gemini, Copilot and ChatGPT 4.0) were presented to a panel of nine dermato-oncologists for assessment of (i) factual accuracy, (ii) concision, (iii) comprehensiveness and (iv) overall preference. In addition, the responses were then quantitatively compared based on lexical (i.e., vocabulary) and semantic (i.e., meaning) similarity to three additional LLMs (ChatGPT 3.5, ChatGPT 4o and Claude). ChatGPT 4.0 had the highest accuracy rate (87.5%, i.e., 21/24 responses), followed by Gemini (50%) and Copilot (25%). All models scored lower for concision and comprehensiveness, with ChatGPT 4.0 in the lead (62.5% comprehensive; 54.2% concise), followed by Gemini (33.3%; 12.5%) and Copilot (16.7%; 8.3%). The panel achieved consensus on model preference in 16 questions (ChatGPT 4.0: 54.2%; Gemini: 8.3%; Copilot: 4.2%; no consensus: 33.3%). While the lexical similarity was found to be low (x̄ ~0.07-0.10 across models), the semantic similarity between the LLM responses was moderate (x̄ ~0.60-0.70 across models). LLMs may assist clinicians in settings where expert dermato-oncological guidance is not readily available, with ChatGPT 4.0 currently outperforming both Gemini and Copilot. Since quantitative methods are unable to detect clinically relevant differences between LLMs, surveying dermatologists is necessary to identify useful models in this rapidly developing field.
- Research Article
7
- 10.3389/frai.2023.1223924
- Sep 22, 2023
- Frontiers in Artificial Intelligence
In the field of automatic text simplification, assessing whether or not the meaning of the original text has been preserved during simplification is of paramount importance. Metrics relying on n-gram overlap assessment may struggle to deal with simplifications which replace complex phrases with their simpler paraphrases. Current evaluation metrics for meaning preservation based on large language models (LLMs), such as BertScore in machine translation or QuestEval in summarization, have been proposed. However, none has a strong correlation with human judgment of meaning preservation. Moreover, such metrics have not been assessed in the context of text simplification research. In this study, we present a meta-evaluation of several metrics we apply to measure content similarity in text simplification. We also show that the metrics are unable to pass two trivial, inexpensive content preservation tests. Another contribution of this study is MeaningBERT (https://github.com/GRAAL-Research/MeaningBERT), a new trainable metric designed to assess meaning preservation between two sentences in text simplification, showing how it correlates with human judgment. To demonstrate its quality and versatility, we will also present a compilation of datasets used to assess meaning preservation and benchmark our study against a large selection of popular metrics.
- Conference Article
2
- 10.5167/uzh-192839
- May 16, 2020
In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German, the first of its kind for this language. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.
- Research Article
40
- 10.1200/po-24-00478
- Oct 1, 2024
- JCO precision oncology
Rapidly expanding medical literature challenges oncologists seeking targeted cancer therapies. General-purpose large language models (LLMs) lack domain-specific knowledge, limiting their clinical utility. This study introduces the LLM system Medical Evidence Retrieval and Data Integration for Tailored Healthcare (MEREDITH), designed to support treatment recommendations in precision oncology. Built on Google's Gemini Pro LLM, MEREDITH uses retrieval-augmented generation and chain of thought. We evaluated MEREDITH on 10 publicly available fictional oncology cases with iterative feedback from a molecular tumor board (MTB) at a major German cancer center. Initially limited to PubMed-indexed literature (draft system), MEREDITH was enhanced to incorporate clinical studies on drug response within the specific tumor type, trial databases, drug approval status, and oncologic guidelines. The MTB provided a benchmark with manually curated treatment recommendations and assessed the clinical relevance of LLM-generated options (qualitative assessment). We measured semantic cosine similarity between LLM suggestions and clinician responses (quantitative assessment). MEREDITH identified a broader range of treatment options (median 4) compared with MTB experts (median 2). These options included therapies on the basis of preclinical data and combination treatments, expanding the treatment possibilities for consideration by the MTB. This broader approach was achieved by incorporating a curated medical data set that contextualized molecular targetability. Mirroring the approach MTB experts use to evaluate MTB cases improved the LLM's ability to generate relevant suggestions. This is supported by high concordance between LLM suggestions and expert recommendations (94.7% for the enhanced system) and a significant increase in semantic similarity from the draft to the enhanced system (from 0.71 to 0.76, P = .01). Expert feedback and domain-specific data augment LLM performance. Future research should investigate responsible LLM integration into real-world clinical workflows.
- Conference Article
3
- 10.1109/iaeac50856.2021.9390937
- Mar 12, 2021
In this paper, a Chinese automatic text simplification(ATS) method based on unsupervised learning was introduced. Automatic text simplification is a research field of natural language processing. In terms of Chinese texts, the reliance on the hand-made simplified corpus or dictionary is not applicable due to a large number of texts. Chinese is a diverse language, and numerous factors need to be taken into consideration. An automatic simplification method based on Chinese text and a readability formula based on linear regression was proposed in this paper. Based on our method, just input a set of Chinese sentences and the more comprehensible sentences can be obtained through syntactic simplification and lexical simplification. Through the automatic evaluation of the hand-made simplified corpus, the readability score of our system increased by 3.68 compared with that of the original text, and the SARI score reached 36.02.
- Research Article
106
- 10.1038/s41746-024-01024-9
- Feb 19, 2024
- NPJ Digital Medicine
Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
- Conference Article
16
- 10.1145/3661167.3661172
- Jun 18, 2024
Context: Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Objective: Our objective is to investigate the extent to which Large Language Models (LLMs) can accelerate title-abstract screening by (1) simplifying abstracts for human screeners, and (2) automating title-abstract screening entirely. Method: We performed an experiment where human screeners performed title-abstract screening for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced by instructing GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied whether different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT) prompting) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of title-abstract screening leads to improved screening performance. Results: Text simplification did not increase the screeners’ screening performance, but reduced the time used in screening. Screeners’ scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that a more recent LLM (GPT-4) is better than its predecessor LLM (GPT-3.5). Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Conclusion: Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies to publish replication packages with screening data to enable more conclusive experimenting with LLM screening.
- Research Article
- 10.2298/csis230912017m
- Jan 1, 2024
- Computer Science and Information Systems
The task of Automatic Text Simplification (ATS) aims to transform texts to improve their readability and comprehensibility. Current solutions are based on Large Language Models (LLM). These models have high performance but require powerful computing resources and large amounts of data to be fine-tuned when working in specific and technical domains. This prevents most researchers from adapting the models to their area of study. The main contributions of this research are as follows: (1) proposing an accurate solution when powerful resources are not available, using the transfer learning capabilities across different domains with a set of linguistic features using a reduced size pre-trained language model (T5-small) and making it accessible to a broader range of researchers and individuals; (2) the evaluation of our model on two well-known datasets, Turkcorpus and ASSET, and the analysis of the influence of control tokens on the SimpleText corpus, focusing on the domains of Computer Science and Medicine. Finally, a detailed discussion comparing our approach with state-of-the-art models for sentence simplification is included.
- Research Article
- 10.62408/ai-ling.v2i2.18
- Oct 20, 2025
- AI-Linguistica. Linguistic Studies on AI-Generated Texts and Discourses
This research aims to describe the performance of ChatGPT-3.5 and ChatGPT-4o in the task of Automatic Text Simplification (ATS) in Italian institutional texts. The aim is to analyse the linguistic differences between the original texts compared to their simplified rewritings by ChatGPT, and the impact of these differences on non-expert users’ experience. A dataset of six short texts was compiled to be rewritten using a zero-shot instructional prompt. The methodological approach combined quantitative linguistic analyses, manual analysis and human judgment to assess the effectiveness of the simplification. For the quantitative linguistic analysis, an additional comparison was made between ChatGPT’s rewritings and human revisions, used as an external benchmark to better contextualize the AI’s simplification strategies. The study provides new insights into the linguistic structure of administrative-bureaucratic texts by examining readability parameters and collecting subjective assessments of comprehension and perceived comprehensibility. It also aims to contribute to the growing body of research on text simplification methods and the role of large language models (LLMs) in enhancing accessibility to complex institutional discourse.
- Conference Article
1
- 10.1109/bip56202.2022.10032482
- Nov 15, 2022
Text simplification refers to the transformation of a specific source text into a target text aiming to increase understanding and readability for one or more specific audiences. This task demands large human efforts and specialized knowledge, which makes the usage of automated or semi-automated computational approaches appealing. The rise of deep learning as an unifying paradigm between seemingly different fields as image analysis, sound processing and natural language processing has considerably influenced the current state of the art approaches for automatic text simplification. Therefore, in this work, we focus on the study of deep learning based state of the art methods for automatic text simplification in the Spanish language. For this end, we first disentangle the different tasks which can be addressed in order to yield a simplified text in general. Later we review the latest deep learning-based approaches, along with the main datasets and performance metrics used in the field. We also describe approaches to deal with small datasets and technical words. Finally, we describe some lessons to build accurate automatic text simplification systems in Spanish, as in this language there is a noticeable shortage of work for text simplification.
- Research Article
37
- 10.2196/64290
- Feb 13, 2025
- Journal of medical Internet research
Laypeople have easy access to health information through large language models (LLMs), such as ChatGPT, and search engines, such as Google. Search engines transformed health information access, and LLMs offer a new avenue for answering laypeople's questions. We aimed to compare the frequency of use and attitudes toward LLMs and search engines as well as their comparative relevance, usefulness, ease of use, and trustworthiness in responding to health queries. We conducted a screening survey to compare the demographics of LLM users and nonusers seeking health information, analyzing results with logistic regression. LLM users from the screening survey were invited to a follow-up survey to report the types of health information they sought. We compared the frequency of use of LLMs and search engines using ANOVA and Tukey post hoc tests. Lastly, paired-sample Wilcoxon tests compared LLMs and search engines on perceived usefulness, ease of use, trustworthiness, feelings, bias, and anthropomorphism. In total, 2002 US participants recruited on Prolific participated in the screening survey about the use of LLMs and search engines. Of them, 52% (n=1045) of the participants were female, with a mean age of 39 (SD 13) years. Participants were 9.7% (n=194) Asian, 12.1% (n=242) Black, 73.3% (n=1467) White, 1.1% (n=22) Hispanic, and 3.8% (n=77) were of other races and ethnicities. Further, 1913 (95.6%) used search engines to look up health queries versus 642 (32.6%) for LLMs. Men had higher odds (odds ratio [OR] 1.63, 95% CI 1.34-1.99; P<.001) of using LLMs for health questions than women. Black (OR 1.90, 95% CI 1.42-2.54; P<.001) and Asian (OR 1.66, 95% CI 1.19-2.30; P<.01) individuals had higher odds than White individuals. Those with excellent perceived health (OR 1.46, 95% CI 1.1-1.93; P=.01) were more likely to use LLMs than those with good health. Higher technical proficiency increased the likelihood of LLM use (OR 1.26, 95% CI 1.14-1.39; P<.001). In a follow-up survey of 281 LLM users for health, most participants used search engines first (n=174, 62%) to answer health questions, but the second most common first source consulted was LLMs (n=39, 14%). LLMs were perceived as less useful (P<.01) and less relevant (P=.07), but elicited fewer negative feelings (P<.001), appeared more human (LLM: n=160, vs search: n=32), and were seen as less biased (P<.001). Trust (P=.56) and ease of use (P=.27) showed no differences. Search engines are the primary source of health information; yet, positive perceptions of LLMs suggest growing use. Future work could explore whether LLM trust and usefulness are enhanced by supplementing answers with external references and limiting persuasive language to curb overreliance. Collaboration with health organizations can help improve the quality of LLMs' health output.