Abstract
Text readability is a measure of how easy or difficult it is to read a text. This readability factor plays a crucial role in the processes of drafting and comprehending the texts, affecting the choice of proper texts for reading. Studies on the readability of text have started since the late nineteenth century and there have been many practical applications. However, these studies are mainly performed in English and other popular languages. In Vietnamese, the study of the text readability is still relatively untapped and has only received attention in recent years in the process of improving the curriculum and teaching methods. Recent studies on the readability of text in Vietnamese language are still limited, the main reason was largely due to the lack of text resources, which are corpora graded accordingly to difficulty levels. Therefore, in this study, we focused on building a corpus for assessing the readability of Vietnamese texts in the literature domain through the process of collecting, processing and evaluating documents. The result is that we have built up a corpus of 1,825 Vietnamese texts, divided into four levels of difficulty (Very easy, Easy, Medium and Difficult). Experiments with the existing Vietnamese readability assessment methods show that the built corpus is reliable and usable for further research on the text readability.
Highlights
Reading is one of the fundamental skills for humans to acquire knowledge all over the world
The article is organized as follows: Section 2 states the criteria for building the corpus; The process of building a corpus for Vietnamese readability assessment along with basic statistics and some experiments are presented in Section 3; Deeper statistics and analysis of the corpus are included in Section 4; Section 5 presents our experiments on the constructed corpus to check the reliability of the corpus; Section 6 concludes the study
We used a machine learning method to evaluate the constructed corpus. This method is based on the study of Tanaka-Ishii et al [32], which used Support Vector Machines (SVM) to create a model that compares and contrasts the readability of the text pairs based on word frequency features:
Summary
Reading is one of the fundamental skills for humans to acquire knowledge all over the world. Dell’Orletta et al [6] examined the corpus for readability features on both the text and the sentence levels. Their corpus was built from two sources: (1) a newspaper, La Republican; and (2) an easy-to-read newspaper, Due Parole. The authors examined these texts to develop the first formula for Vietnamese readability assessment [8]. In the recent studies on Vietnamese text readability, Luong et al [10, 12, 11], Diep et al [13] examined around 380 texts collected from school textbooks to examine the effect of the text length and some specific Vietnamese language features on the text readability. The article is organized as follows: Section 2 states the criteria for building the corpus; The process of building a corpus for Vietnamese readability assessment along with basic statistics and some experiments are presented in Section 3; Deeper statistics and analysis of the corpus are included in Section 4; Section 5 presents our experiments on the constructed corpus to check the reliability of the corpus; Section 6 concludes the study
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have