Abstract
The Web is the communication platform and source of information par excellence. The volume and complexity of its content have grown enormously, with organizing, retrieving, and cleaning Web information becoming a challenge for traditional techniques. Web intelligence is a novel research area to improve Web‐based services and applications using artificial intelligence and automatic learning algorithms, for which a large amount of Web‐related data are essential. Current datasets are, however, limited and do not combine visual representation and attributes of Web pages. Our work provides a large dataset of 49,438 Web pages, composed of webshots, along with qualitative and quantitative attributes. This dataset covers all the countries in the world and a wide range of topics, such as art, entertainment, economics, business, education, government, news, media, science, and the environment, addressing different cultural characteristics and varied design preferences. We use this dataset to develop three Web Intelligence applications: knowledge extraction on Web design using statistical analysis, recognition of error Web pages using a customized convolutional neural network (CNN) to eliminate invalid pages, and Web categorization based solely on screenshots using a CNN with transfer learning to assist search engines, indexers, and Web directories.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.