Abstract

The task of selecting relevant features is a hard problem in the field of unsupervised text clustering due to the absence of class labels that would guide the search. This paper proposes a new mixture model method for unsupervised text clustering, named multinomial mixture model with feature selection (M3FS). In M3FS, we introduce the concept of component-dependent “feature saliency” to the mixture model. We say a feature is relevant to a certain mixture component if the feature saliency value is higher than a predefined threshold. Thus the feature selection process is treated as a parameter estimation problem. The Expectation–Maximization (EM) algorithm is then used for estimating the model. The experiment results on commonly used text datasets show that the M3FS method has good clustering performance and feature selection capability.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call