Abstract

Nowadays, the fast advance of internet technology has brought two challenges. The first one is explosion of information. The second one is new information appears rapidly. Obviously, clustering is a good solution to help users analyze information automatically, whereas traditional clustering algorithms are only suitable for small-scale and stable text collection. In order to solve this problem, a novel clustering algorithm based on vector compression particularly for large-scale text collection (LDVC) and its incremental version (I-LDVC) are proposed in this paper. LDVC selects related features to compress feature sets. Iterative training idea of self- organizing-mapping (SOM) is also imported in it to optimize selection approach. Besides, when novel texts appear, its incremental version (I-LDVC) can select small samples from original texts to alter neuron model to perform incremental clustering. In order to prevent it from over fitting to new added texts, I-LDVC adjusts the weights of samples along with training process. Experimental results demonstrate that LDVC has better performance and lower time complexity on large-scale text collection, and I-LDVC can cluster unstable text collection very well. DOI: http://dx.doi.org/10.5755/j01.itc.45.2.8666

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call