Abstract

With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. Thus, the “big geospatial data management” issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the WWW, users cannot find resources of their interests efficiently. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and Environmental Systems Research Institute, Inc (ESRI) Shapefiles. In addition, we apply the distributed computing concept to promote the performance of the GeoWeb Crawler. The result shows that for 10 targeted resources types, the GeoWeb Crawler discovered 7351 geospatial services and 194,003 datasets. As a result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide a comprehensive index of GeoWeb.

Highlights

  • According to [6], GeoWeb is based on a framework of open standards and standards-based technologies, such as geospatial services and Spatial Data Infrastructure (SDI) [8]

  • This study mainly focuses on addressing the issue of GeoWeb resource discovery

  • These problems would cause GeoWeb resources discovery inefficient

Read more

Summary

Big Geospatial Data

Geospatial data are used to help decision making, design, and statistic in various fields, such as administration, financial analyses, and scientific researches [1,2,3,4]. With the advance of the World-Wide Web (WWW), the Web 2.0 represents a concept that allows everyone to set up websites or post content on the Web [5]. Users can share their data or services on different Web 2.0 platforms, such as Facebook, Wikipedia, and YouTube. Data on the Web are generated in high rate (velocity), require large storages (volume), and have various types (variety), such as text, image, and video. The sensor observations in GeoWeb [11] are produced by a large number of sensor nodes that monitor various kinds of phenomenon, such as temperature, humidity, and air pressure, and have a large variety regarding data formats, web protocols, and semantic frameworks

Problems in Geospatial Resources Discovery
Existing Approaches for GeoWeb Resource Discovery
Methodology
Workflow
Discovering Plain-Text URLs with Regular Expressions
Identification Mechanisms for Different Geospatial Resources
Distributed Crawling Process for Scalability
Crawling Latency Comparison between Standalone and Parallel Processing
GeoHub—A GeoWeb Search Engine Prototype
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.