Abstract

In this paper, we present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction. We make the main distinction between manual and (semi)–automatic approaches in order to investigate in which respect complex and simple texts vary and whether and how the observed modifications may depend on the underlying approach. To this end, we perform a two-level comparison on Italian corpora, since this is the only language, with the exception of English, for which there are large parallel resources derived through the two approaches considered. The first level of comparison accounts for the main types of sentence transformations occurring in the simplification process, the second one examines the results of a linguistic profiling analysis based on Natural Language Processing techniques and carried out on the original and the simple version of the same texts. For both levels of analysis, we chose to focus our discussion mostly on sentence transformations and linguistic characteristics that pertain to the morpho-syntactic and syntactic structure of the sentence.

Highlights

  • AND MOTIVATIONAutomatic Text Simplification (ATS) is the Natural Language Processing (NLP) task aimed at reducing linguistic complexity of texts, especially at the lexical and syntactic levels, while preserving their original content (Bott and Saggion, 2014; Shardlow, 2014; Alva-Manchego et al, 2020a)

  • We present an overview of existing parallel corpora for Automatic Text Simplification (ATS) in different languages focusing on the approach adopted for their construction

  • While in the previous section we focused on the comparison between the manual and automatic approach concerning the distribution of simplification rules, here we examine the distribution of a wide set of linguistic phenomena characterizing the complex and simple sentences of each corpus

Read more

Summary

INTRODUCTION

Automatic Text Simplification (ATS) is the Natural Language Processing (NLP) task aimed at reducing linguistic complexity of texts, especially at the lexical and syntactic levels, while preserving their original content (Bott and Saggion, 2014; Shardlow, 2014; Alva-Manchego et al, 2020a) It has long attracted the attention of different research communities that address the issue of generating a simplified version of an input text from two broad perspectives. It is worth noticing that, with only a few exceptions (see Section 2 for details), these corpora are smaller than the ones available for English and this has made it hardly feasible to use them as training data for pure ATS systems based on machine learning methods It is the reason why similar resources were primarily collected to be used as reference corpora to identify the most frequent simplification operations occurring in manually-simplified texts or to train rulebased systems covering limited sets of simplification phenomena, as in the case of, e.g., Italian (Barlacchi and Tonelli, 2013), Basque (Aranzabe et al, 2013), French (Brouwers et al, 2014), and German (Suter et al, 2016). This poses several issues related to the quality of the training data and, as a consequence, of the resulting automatically simplified sentences

Our Contribution
Manual Approach
AN OVERVIEW OF THE MAIN
A TWO-LEVEL COMPARISON
Distribution of Simplification
Distribution of Linguistic Phenomena
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.