Abstract

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. It has also been proven that differentiating noisy and unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different blocks in a web page. Through a user study, we found that people do have a consistent view about the importance of blocks in a web page. Thus, we investigate how to find a model to automatically assign importance values to blocks in a web page. We formulate the block importance estimation as a learning problem. First, we use a vision-based page segmentation technique to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Then, learning algorithms are used to train a model to assign importance to each block in the web page. In our experiments, the best model can achieve the performance with Micro-F1 80.2% and Micro-Accuracy 86.8%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.