Abstract

Text readability is a measure of how easy or difficult it is to read a text. This readability factor plays a crucial role in the processes of drafting and comprehending the texts, affecting the choice of proper texts for reading. Studies on the readability of text have started since the late nineteenth century and there have been many practical applications. However, these studies are mainly performed in English and other popular languages. In Vietnamese, the study of the text readability is still relatively untapped and has only received attention in recent years in the process of improving the curriculum and teaching methods. Recent studies on the readability of text in Vietnamese language are still limited, the main reason was largely due to the lack of text resources, which are corpora graded accordingly to difficulty levels. Therefore, in this study, we focused on building a corpus for assessing the readability of Vietnamese texts in the literature domain through the process of collecting, processing and evaluating documents. The result is that we have built up a corpus of 1,825 Vietnamese texts, divided into four levels of difficulty (Very easy, Easy, Medium and Difficult). Experiments with the existing Vietnamese readability assessment methods show that the built corpus is reliable and usable for further research on the text readability.

Highlights

  • Reading is one of the fundamental skills for humans to acquire knowledge all over the world

  • The article is organized as follows: Section 2 states the criteria for building the corpus; The process of building a corpus for Vietnamese readability assessment along with basic statistics and some experiments are presented in Section 3; Deeper statistics and analysis of the corpus are included in Section 4; Section 5 presents our experiments on the constructed corpus to check the reliability of the corpus; Section 6 concludes the study

  • We used a machine learning method to evaluate the constructed corpus. This method is based on the study of Tanaka-Ishii et al [32], which used Support Vector Machines (SVM) to create a model that compares and contrasts the readability of the text pairs based on word frequency features:

Read more

Summary

Introduction

Reading is one of the fundamental skills for humans to acquire knowledge all over the world. Dell’Orletta et al [6] examined the corpus for readability features on both the text and the sentence levels. Their corpus was built from two sources: (1) a newspaper, La Republican; and (2) an easy-to-read newspaper, Due Parole. The authors examined these texts to develop the first formula for Vietnamese readability assessment [8]. In the recent studies on Vietnamese text readability, Luong et al [10, 12, 11], Diep et al [13] examined around 380 texts collected from school textbooks to examine the effect of the text length and some specific Vietnamese language features on the text readability. The article is organized as follows: Section 2 states the criteria for building the corpus; The process of building a corpus for Vietnamese readability assessment along with basic statistics and some experiments are presented in Section 3; Deeper statistics and analysis of the corpus are included in Section 4; Section 5 presents our experiments on the constructed corpus to check the reliability of the corpus; Section 6 concludes the study

Criteria for building the corpus
Corpus building
Pre-processing
Expert evaluation
Very easy
Difficult
Reliability testing
Conflicts of Interest
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call