Learning to integrate unlabeled data in text classification

Eric P Jiang

doi:10.1109/iccsit.2010.5564473

Abstract

The paper deals with the text classification problem where labeled training samples are very limited while unlabeled data are readily available in large quantities. The paper proposes an efficient classification algorithm that incorporates a weighted k-means clustering scheme into an Expectation Maximization (EM) process. It aims to balance predictive values between labeled and unlabeled training data and improve classification accuracy. Since the algorithm is based on a fast clustering method, it can be applied to classify documents in large datasets. Preliminary experiments with several text classification collections show that the proper use of unlabeled data built in this proposed text classification algorithm could significantly improve classification accuracy.

Full Text