Web Content Extraction Using Clustering with Web Structure

Xiaotao Huang,Liqun Huang,Yuhua Li,Zhizhao Zhang,Ling Kang,Fen Wang,Yan Gao

doi:10.1007/978-3-319-59072-1_12

Abstract

Web content extraction is an essential part of data preprocessing in web information system. An algorithm for web content extraction based on clustering with web structure is proposed. The whole process can be divided in two steps. In the first step, clustering with the web pages collected from different websites. During this processing, similarity measurement of web page based on dynamic programming of weight is used. First, the web page is parsed to DOM tree; second, the weight is assigned to every node according to the position of the node and the amount of nodes in same depth and the depth of the DOM tree; third, calculating the similarity of two pages according to the given formula. When the first step is finished, web pages with similar structure would be divided into a set. In the second step, pages in the same set are compared and the same parts of pages will be removed, thus the remain is the web content. Experiments show that the proposed algorithm works with great effectiveness and accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Web Content Extraction Using Clustering with Web Structure

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas
C I Ezeife ... Bindu Peravali
-
C I Ezeife, et. al.C I Ezeife ... Bindu Peravali
01 Jan 2015
01 Jan 2015

Web Content Extraction by Integrating Textual and Visual Importance of Web Pages
J Anitha ... K Nethra
International Journal of Computer Applications | VOL. 91
J Anitha, et. al.J Anitha ... K Nethra
18 Apr 2014
International Journal of Computer Applications | VOL. 91

Web Content Extraction through Histogram Clustering
Tim Weninger ... William Hsu
-
Tim Weninger, et. al.Tim Weninger ... William Hsu
01 Jan 2008
01 Jan 2008

Chinese Web Content Extraction Based on Naïve Bayes Model
Wang Jinbo ... Gao Wanlin
-
Wang Jinbo, et. al.Wang Jinbo ... Gao Wanlin
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Web Content Extraction Using Clustering with Web Structure

Abstract

Talk to us

Similar Papers