IHWC: intelligent hidden web crawler for harvesting data in urban domains

Sawroop Kaur,Aman Singh,G Geetha,Xiaochun Cheng

doi:10.1007/s40747-021-00471-1

Sawroop Kaur, Aman Singh + Show 2 more

Open Access

https://doi.org/10.1007/s40747-021-00471-1

Copy DOI

Abstract

AbstractDue to the massive size of the hidden web, searching, retrieving and mining rich and high-quality data can be a daunting task. Moreover, with the presence of forms, data cannot be accessed easily. Forms are dynamic, heterogeneous and spread over trillions of web pages. Significant efforts have addressed the problem of tapping into the hidden web to integrate and mine rich data. Effective techniques, as well as application in special cases, are required to be explored to achieve an effective harvest rate. One such special area is atmospheric science, where hidden web crawling is least implemented, and crawler is required to crawl through the huge web to narrow down the search to specific data. In this study, an intelligent hidden web crawler for harvesting data in urban domains (IHWC) is implemented to address the relative problems such as classification of domains, prevention of exhaustive searching, and prioritizing the URLs. The crawler also performs well in curating pollution-related data. The crawler targets the relevant web pages and discards the irrelevant by implementing rejection rules. To achieve more accurate results for a focused crawl, ICHW crawls the websites on priority for a given topic. The crawler has fulfilled the dual objective of developing an effective hidden web crawler that can focus on diverse domains and to check its integration in searching pollution data in smart cities. One of the objectives of smart cities is to reduce pollution. Resultant crawled data can be used for finding the reason for pollution. The crawler can help the user to search the level of pollution in a specific area. The harvest rate of the crawler is compared with pioneer existing work. With an increase in the size of a dataset, the presented crawler can add significant value to emission accuracy. Our results are demonstrating the accuracy and harvest rate of the proposed framework, and it efficiently collect hidden web interfaces from large-scale sites and achieve higher rates than other crawlers.

Highlights

Smart cities are the essence of new age comfortable living in urban areas such as towns and cities
The page classification is based on the similarity index between the web page extracted by the crawler and the seed pages of a specific domain
Feature space for a hidden web site is based on URL, anchor and text around the anchor

Summary

Introduction

Smart cities are the essence of new age comfortable living in urban areas such as towns and cities. There are certain objectives for the development of smart cities. One such objective is the reduction in air pollution and making better area-based developments. To implement solutions regarding this objective, the data are required to be crawled. The objective of this study is to implement intelligent location-aware hidden web crawling focused on urban pollution data. A supervision-based hidden web crawler is developed for collecting data and it is implemented for both hidden web domains and for crawling pollution data from the web. From the numerous ways to collect data, web search is one of the most used search methods. It is claimed that 85% of the users rely on search

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Complex & Intelligent Systems	Publication Date: Jul 24, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

IHWC: intelligent hidden web crawler for harvesting data in urban domains

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complex & Intelligent Systems

Lead the way for us

Similar Papers

A Survey on Content Based Crawling for Deep and Surface Web
Nishchay Agrawal ... Suchi Johari
-
Nishchay Agrawal, et. al.Nishchay Agrawal ... Suchi Johari
01 Nov 2019
01 Nov 2019

Improving the freshness of the search engines by a probabilistic approach based incremental crawler
G Pavai ... T V Geetha
Information Systems Frontiers | VOL. 19
G Pavai, et. al.G Pavai ... T V Geetha
15 Sep 2016
Information Systems Frontiers | VOL. 19

Addressing big data challenges in smart cities: a systematic literature review
Sumedha Chauhan ... Arpan Kumar Kar
info | VOL. 18
Sumedha Chauhan, et. al.Sumedha Chauhan ... Arpan Kumar Kar
13 Jun 2016
info | VOL. 18

VITALIZED BI-LEVEL WEB CRAWLER FOR REMOVAL OF REDUNDANT CONTENT IN DEEP WEB INTERFACE
Supriya.H.S
International Journal of Research in Engineering and Technology | VOL. 05
Supriya.H.S Supriya.H.S
25 May 2016
International Journal of Research in Engineering and Technology | VOL. 05

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

IHWC: intelligent hidden web crawler for harvesting data in urban domains

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complex &amp; Intelligent Systems

More From: Complex & Intelligent Systems