Abstract

Internet has gained greatest acceptance as reservoirs of information. It has been observed that the web page along with main content comprises of noise (advertisement, external links), which poses difficulty for various search engines crawlers to correctly classify the web page and it also provides distraction to the user interested in gathering relevant data. In this paper, we proposed a novel approach which categorises the relevant content from the web page and use this information to filter and rearrange the content of the web page. We used the web page segmentation algorithm for parsing the web page to obtain non-overlapping visual blocks and then extracted the features from these visual blocks to build the dataset. The dataset have been trained using popular machine learning classifier techniques (neural network, RBF neural network) to discriminate content. Finally, the classification output is used to perform main content filtering of the web page. We also analysed the importance of features on the learning process and perceive that the embedded objects from external source have highest significance for block identification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.