Research on Popular Words and Phrases Extraction of Network Base on PAT TREE

Baozhen Wu,Li Li,Yong Zhang,Tingting He,Long Chen

doi:10.1109/csse.2008.1210

Abstract

This paper aims to mine popular words and phrases from internet by specific algorithm. We download web pages from Jan 1st 2007 to Jun 30th 2007 from different information sources of the network. We filter the set of the candidate words by three times in turn based on full segmentation with Pat-Tree. The first is the weight filter based on the vector space model, then used by the model of language regulation, the last through the filtration of rubbish cluster. Finally, we extract the popular words and phrases from the set of candidate words by the popular words determinant formula. At the same time we draw the tendcy curves of the popular words. The experimentation indicates that without reducing the correct rate of catchwords, the speed of computer-aided the popular words and phrases of network impoved distinctly.

Full Text