Abstract

Commercial web pages contain several blocks of information. Apart from the core content blocks, there exist subsidiary blocks like privacy notifications, advertisements, navigations and copyrights. Such blocks are called as noise information blocks. The information present in the noise blocks can deteriorate the performance of Information Retrieval. Eliminating these noises becomes a great challenge. This paper aims to extract the vital information from the web pages by eliminating noise. Once the content has been extracted, it is cleaned and presented in a standard format. We propose a system wherein content from the web pages are extracted using an unsupervised technique. This involves the utilization of the web page segmentation technique wherein the webpage is partitioned into incoherent visual blocks. Statistical methods of clustering are used to classify the visual blocks based on their features. The content that contributes to the required search is filtered from the web page and the user is presented with a webpage clean from noise. Our approach concentrates on eliminating the Crucial Noises, Auxiliary Content, Noise content based on block importance. Importance of every block is computed using the Hybrid hash algorithm. With respect to the threshold value, appropriate blocks are selected using the Enhanced-Sketching algorithm, which makes the Webpage suitable for effective Information Retrieval.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.