To extract informative content from online web pages by using hybrid approach

Madhura R Kaddu,R B Kulkarni

doi:10.1109/iceeot.2016.7754831

Abstract

In the web pages contains large and vast amount of data, which is rich source of information available to everyone in the world through Internet. However the web page is combination of noisy data like navigational link, advertisements, menus, footer, etc and informative data, so the complexity may increases for main content extraction from web pages. To access main content hand crafted rule technique were used which contain string manipulation functions but preparing these rules becomes much difficult task for the users. So we present a hybrid approach for extracting main content from the web pages. This approach is based on combination of automatic extraction and hand crafted rules techniques. The propose work is mainly focus on automatic extraction technique in which automatic rules are created instead of manual hand crafted rule creation. In this the web page is converted into DOM tree and features are extracted. Use this features in the machine leaning method like decision tree classification and dynamic rules are created. By using these rules informative content is extracted from the web pages. Further the rules which are created in the automatic extraction technique used as hand crafted rule for content extraction from the web pages without using machine learning method. Here the informative content like relevant text, images, multimedia are extracted from the web pages and these web pages are taken from the Internet i.e. Online. And also consider dataset for extracting relevant content. The propose work generates effective rules, achieve automaticity and efficiency.

Full Text