Boilerplate Removal and Content Extraction from Dynamic Web Pages

Pan Ei San

doi:10.5121/ijcsea.2014.4603

Abstract

Web pages not only contain main content, but also other elements such as navigation panels, advertisements and links to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extract only the relevant cfrom web page. Main textual contents are just included in HTML source code which makes up the files. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, and copyright n otices in web pages. The system removes boilerplate and extracts main content. In this system, there are two phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page. Content Extraction algorithmdescribes to get high performance without parsing DOM trees. After observation the HTML tags, one line may not contain a piece of complete information and long texts are distributed in close lines, this system usesLine-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or non- content.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Boilerplate Removal and Content Extraction from Dynamic Web Pages

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Science, Engineering and Applications

Lead the way for us

Journal: International Journal of Computer Science, Engineering and Applications	Publication Date: Dec 31, 2014
Citations: 3

Similar Papers

Main Content Extraction from Web Pages
Stanislas Morbieu ... Mohamed Kone
-
Stanislas Morbieu, et. al.Stanislas Morbieu ... Mohamed Kone
01 Dec 2020
01 Dec 2020

To extract informative content from online web pages by using hybrid approach
Madhura R Kaddu ... R B Kulkarni
-
Madhura R Kaddu, et. al.Madhura R Kaddu ... R B Kulkarni
01 Mar 2016
01 Mar 2016

Content extraction based on statistic and position relationship between title and content
Mingdong Li ... Pingping Xu
-
Mingdong Li, et. al.Mingdong Li ... Pingping Xu
01 Oct 2014
01 Oct 2014

A Survey on Improving the Web Search Ranking by User Behavior Information
Mohamed Husain ... Manoj Kumar
SSRN Electronic Journal | VOL. -
Mohamed Husain, et. al.Mohamed Husain ... Manoj Kumar
01 Jan 2009
SSRN Electronic Journal | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Boilerplate Removal and Content Extraction from Dynamic Web Pages

Abstract

Talk to us

Similar Papers

More From: International Journal of Computer Science, Engineering and Applications