This paper presents the Portuguese dataset of the iRead4Skills project (Dataset 1: corpora by complexity level for FR, PT, and SP – v.2.0), a representative sample of written European Portuguese for automatic complexity assessment that addresses a gap in existing resources for Portuguese. The corpus was created within the framework of the iRead4Skills project, which encompasses Portuguese, French, and Spanish. The project aims to develop an intelligent system to evaluate text complexity while recommending appropriate reading materials to native adult learners with low literacy skills. The corpus compilation involved a manual selection of text samples across various textual genres and document types, covering a wide range of existing written materials and focusing on the reading needs and reading habits of the target audience—low literacy adults enrolled in vocational education and training centres or adult learning (AL) centres. The collected texts were categorised into the three distinct levels of complexity targeted and defined by the project: very easy, easy, and plain levels. Texts of higher complexity were also included, resulting in the creation of four distinct sub-corpora. The resulting Portuguese dataset consists of 2,186 texts and 942,818 tokens and serves as the foundational source for training and testing the project’s complexity analysis systems. This paper presents a comprehensive overview of the compilation process of the corpus, encompassing its methodological design and the challenges faced. Although some existing Portuguese corpora were used for complexity studies and tool development, these primarily consist of texts classified according to CERF levels and retrieved from didactic materials designed for L2 teaching/learning or texts produced by L2 learners. The corpus presented in this paper introduces a new resource that addresses a significant gap in materials needed to inform and support studies and applications related to text complexity. The resulting dataset provides a novel and important language resource for European Portuguese, with several applications including research on linguistic complexity, development of automatic text complexity and readability assessment systems, and educational purposes.
Read full abstract