Abstract

This paper constructed a latent semantic text model using genetic algorithm (GA) for web clustering. The main difficulty in the application of GA for text clustering is thousands or even tens of thousands of dimensions in the feature space. Latent semantic indexing (LSI) is a successful technology which attempts to explore the latent semantics structure in textual data, and furthermore, it reduces this large space to smaller one and provides a robust space for clustering. GA belongs to search techniques that efficiently evolve the optimal solution for the problem. Evolved in the reduced latent semantic indexing model, GA can improve clustering accuracy and speed which is typically suitable for real time clustering. We used SSTRESS criteria to analyze the dissimilarity between original term-by-document corpus matrix and the approximate decomposition matrix with different ranks corresponding to the performance of our algorithm evolved in the reduced space. The superiority of GA applied in LSI model over K-means and conventional GA in the vector space model (VSM) is demonstrated by providing good Reuter text clustering results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call