Abstract

AbstractA potentially useful feature of information retrieval systems for students is the ability to identify documents that not only are relevant to the query but also match the student's reading level. Manually obtaining an estimate of reading difficulty for each document is not feasible for very large collections, so we require an automated technique. Traditional readability measures, such as the widely used Flesch‐Kincaid measure, are simple to apply but perform poorly on Web pages and other nontraditional documents. This work focuses on building a broadly applicable statistical model of text for different reading levels that works for a wide range of documents. To do this, we recast the well‐studied problem of readability in terms of text categorization and use straightforward techniques from statistical language modeling. We show that with a modified form of text categorization, it is possible to build generally applicable classifiers with relatively little training data. We apply this method to the problem of classifying Web pages according to their reading difficulty level and show that by using a mixture model to interpolate evidence of a word's frequency across grades, it is possible to build a classifier that achieves an average root mean squared error of between one and two grade levels for 9 of 12 grades. Such classifiers have very efficient implementations and can be applied in many different scenarios. The models can be varied to focus on smaller or larger grade ranges or easily retrained for a variety of tasks or populations.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.