Abstract
With the popularity of social networks, short text clustering has become a more and more important task that is widely used. Short text clustering is a challenging problem because social network short texts are characterized by irregular words, a lot of noise, and sparse features. We propose a Short Text Clustering enhanced by Semantic Matching Model (abbr. to STCSMM). The STCSMM method applies the knowledge of the tagged text similarity task dataset to the short text clustering through the semantic matching model, thereby improving the effect of short text clustering. First, we train a semantic matching network on the data set of the text similarity task, where the network contains the feature extraction layer and the vector distance calculation layer. Then, we use the learned feature extraction layer to extract short text feature and use the vector distance calculation layer replaces the commonly used distance metrics in the traditional K-means algorithm, such as cosine distance, Euclidean distance and so on. Finally, the text features obtained by feature extraction layer are applied to K-means based on vector distance calculation layer. This improved K-means clustering (STCSMM) has better performance on the microblog text clustering dataset than some existing methods such as K-means clustering with LDA, LSI or average word embedding feature vectors.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have