Abstract
Context: Web page segmentation methods have been used for different purposes such as web page classification and content analysis. These methods categorize a web page into different blocks, where each block contains similar components. Objective: The goal of this paper is to propose a new segmentation approach that semantically segments web pages into integrated blocks and obtains high segmentation accuracy. Method: In this paper, we propose a new segmentation model that semantically segments web pages into integrated blocks, where (1) it merges web page content into basic-blocks by simulating human perception using Gestalt laws of grouping; and, (2) it utilizes semantic text similarity to identify similar blocks and regroup these similar basic-blocks as integrated blocks. Results: To verify the accuracy of our approach, we (1) applied it to three datasets, (2) compared it with the five existing state-of-the-art algorithms. The results show that our approach outperforms all the five comparison methods in terms of precision, recall, F-1 score, and ARI. Conclusion: In this paper, we propose a new segmentation model and apply it to three datasets to (1) generate basic-blocks by simulating human perception to segment a web page, (2) identify semantically related blocks and regroup them as an integrated block, and (3) address limitations found in existing approaches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.