Abstract

The paper deals with the text classification problem where labeled training samples are very limited while unlabeled data are readily available in large quantities. The paper proposes an efficient classification algorithm that incorporates a weighted k-means clustering scheme into an Expectation Maximization (EM) process. It aims to balance predictive values between labeled and unlabeled training data and improve classification accuracy. Since the algorithm is based on a fast clustering method, it can be applied to classify documents in large datasets. Preliminary experiments with several text classification collections show that the proper use of unlabeled data built in this proposed text classification algorithm could significantly improve classification accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call