Abstract

Web page keyword extraction is widely used in web text classification, text clustering, and information retrieval. However, the keyword extraction of the Chinese web page still need be improved and applied, especially in the medical field. This paper proposes an improved TF-IDF algorithm based on WF-TF-IDF to extract keywords from Chinese medical web page. The WF-TF-IDF algorithm considers three factors which are word frequency in the title, description and word distribution of categories in the corpus. We do the data-preprocessing which includes web page denoising, regular expression processing, Chinese word segmentation, synonyms exchanging and stop word filtering. Then we extract keywords based on the result of data-preprocessing. We filter the meaningless words in the extracted keywords according to the part of speech. The experimental results shows that the WF-TF-IDF algorithm improves the precision rate and recall rate by about 7% compared to the traditional TF-IDF algorithm.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call