Improved algorithm of thematic term extraction based on increment term-set frequency from Chinese document

Xinglin Liu

doi:10.3724/sp.j.1087.2013.02546

Abstract

In order to solve the problem that the thematic term extraction algorithm based on incremental term-set frequency cannot extract compound-words, this paper added text preprocessing, compound-word recognition, to the original algorithm. Compound-word recognition was based on part-of-speech detection and word co-occurrence directed graph, and corrected the results of segmentation. When generating thematic term candidate set, the position of each word was examined and determined its weight. And then, the total weight of the same word was accumulated, and a candidate set of thematic terms was generated by the weight from high to low. When this algorithm got a term from thematic term candidate set, the increment frequency was calculated. If the increment was less than a given threshold, the algorithm stopped; otherwise, the thematic term candidate was added into thematic term set. The experimental results show this algorithm achieves sound effects,the thematic terms acquired by this algorithm can more aptly reflect the main contents of the article, and the satisfaction of thematic term increased 5% than the original algorithm.

Full Text