Abstract

As the rapid development of computer technology and network communication, short text data has increased enormously. Classifying the short text snippets is a great challenge to due to its less semantic information and high sparseness. In this paper, we proposed an improved short text classification method based on Latent Dirichlet Allocation topic model and K-Nearest Neighbor algorithm. The generated probabilistic topics help both make the texts more semantic-focused and reduce the sparseness. In addition, we present a novel topic similarity measure method with the topic-word matrix and the relationship of the discriminative terms between two short texts. A short text dataset for experiment validation is constructed by crawling the posts from Sina News website. The extensive and comparable experimental results obtained show the effectiveness of our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call