A Method of Readability Assessment for Web Documents Using Text Features and HTML Structures

Takahiro Yamasaki,Kin‐Ichiroh Tokiwa

doi:10.1002/ecj.11565

Abstract

SUMMARYThis paper describes a method of readability assessment for Web documents. Readability is the ease in which text can be read and understood. We hypothesize that the readability is determined by whether a reader can easily grasp text structures. The impression and complexity of text are significant factors. We extract features of impression and complexity from plain text and additional data, such as HTML tags. In order to compare the effect of extracting features, we assess readability rank by machine learning. We conduct fivefold cross validation for each domain and calculate the root mean squared error between the actual rank and the estimated rank. Cross validation experiments confirm that the performance of our method is high, showing the effectiveness of extracting features about the impression and complexity for readability assessment.

Full Text