Abstract

AbstractAs the dramatic expansion of online publications continues, state libraries urgently need effective tools to organize and archive the huge number of government documents published online. Automatic text categorization techniques can be applied to classify documents approximately, given a sufficient number of labeled training examples. However, obtaining training labels is very expensive, requiring a lot of manual labor. We present a real world online government information preservation project (PEP) in the State of Illinois, and a semi‐supervised machine learning approach, an Expectation‐Maximization (EM) algorithm‐based text classifier, which is applied to automatically assign subject headings to documents harvested in the PEP project. The EM classifier makes use of easily obtained unlabeled documents and thus reduces the demand for labeled training examples. This paper describes both the context and the procedure of such an application. Experiment results are reported and other alternative approaches are also discussed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.