Discovering Informative Contents of Web Pages

Qifeng Fan,Lifu Huang,Lian’En Huang,Chunwei Yan

doi:10.1007/978-3-319-08010-9_20

Abstract

AbstractThe World Wide Web has become a huge information repository. However, besides informative contents, the Web pages also contain redundant contents, which are considered harmful for Web mining and searching systems. In this paper, we propose a new approach to discover informative contents from a set of Web pages within a single Web site. Our method works as follows: First, we propose a newly designed Site Style Tree, to capture the common presentation styles and the actual contents of the pages in the given Web site. The tree structure, which is different from the one formerly proposed, is built by aligning pages of the site. For each node of SST, informative contents are discovered based on entropy and threshold method. The proposed approach is evaluated with two mining tasks, Web page clustering and classification. The experimental performance shows a significant improvement when compared to previous template detection approaches.KeywordsTemplate DetectionInformation ExtractionEntropy

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Discovering Informative Contents of Web Pages

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Wrapper Generation for Automatic Data Extraction from Large Web Sites
Nitin Jindal
-
Nitin JindalNitin Jindal
01 Jan 2004
01 Jan 2004

Internet Search Engines
Vijay Kasi ... Radhika Jain
-
Vijay Kasi, et. al.Vijay Kasi ... Radhika Jain
01 Jan 2006
01 Jan 2006

Internet Search Engines
Vijay Kasi ... Radhika Jain
-
Vijay Kasi, et. al.Vijay Kasi ... Radhika Jain
18 Jan 2011
18 Jan 2011

Multiple Template Detection Based on Segments
Bo Gao ... Qifeng Fan
-
Bo Gao, et. al.Bo Gao ... Qifeng Fan
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Discovering Informative Contents of Web Pages

Abstract

Talk to us

Similar Papers