Abstract We describe the first wide results of the linguistic profiling of the Common European Framework of Reference (CEFR)-levelled English Corpus (CLEC), a corpus built up for Natural Language Processing purposes. The CLEC is a proficiency-levelled English corpus that covers A1, A2, B1, B2, and C1 CEFR levels and that has been built up to train statistic models for automatic proficiency assessment. We describe not only the main aspects of the corpus development but also display the linguistic features and the statistic results for levels A2, B1, and B2 written examples, carried out automatically. We show how raw text, lexical, morphosyntactic, or syntactic statistic outcomes can help to identify levels of proficiency, to test teaching materials accurate proficiency classification, to provide computable support to new text proficiency validation, and to specify level boundaries. In fact, upper levels strengthen proficiency by showing higher outcomes of lexical and syntactic complexity. This analysis validates the use of automatic tools for proficiency level identification based on lexical and syntactic data, whereas morphosyntactic features strengthen competence-level distinctions. Finally, we suggest that these results are a first step onto the CEFR-levelled automatic assessment of new texts.
Read full abstract