Incremental clustering in short text streams based on BM25

Lixin Xu Lixin Xu,Lei Yang Lei Yang,Guang Chen Guang Chen

doi:10.1109/ccis.2014.7175694

Abstract

Since short text is short of keywords and has sparse features, it brings about the similarity drift problem. The traditional clustering algorithms are usually ineffective and a waste of resources on dealing with short text stream. To overcome the above problems, this paper proposes an incremental clustering algorithm in short text streams based on BM25. The approach makes full use of BM25 to extract keywords and weights of each cluster, and applies extracted parameters to similarity calculation. Theoretical analysis and experiments show that the proposed incremental clustering algorithm solves the similarity drift problem well and achieves satisfactory accuracy and performance in terms of short text stream clustering, compared with the traditional clustering algorithms.

Full Text