A Chinese Text Similarity Calculation Algorithm Based on DF_LDA

Chao Zhang,Li Chen,Qiong Li

doi:10.2991/978-94-6239-148-2_61

Abstract

In order to reduce Chinese text similarity calculation complexity and improve text clustering accuracy, this paper proposes a new text similarity calculation algorithm based on DF_LDA. First, we use DF method to realize feature extraction; then, we use LDA method to construct text topic model; finally, we use DF_LDA model obtained to calculate text similarity. Due to considering the text semantic and word frequency information, the new method can improve text clustering precision. In addition, DF_LDA method reduces text feature vector dimensions twice; it can efficiently save text similarity calculating time, and increases text clustering speed. Our experiments on TanCorp-12-Txt and FuDanCorp datasets demonstrate that the proposed method can reduce modeling time efficiently, and improves text clustering accuracy effectively.

Full Text