Lowering the Barriers for Accessing Distributed Geospatial Big Data to Advance Spatial Data Science: The PolarHub Solution

Wenwen Li

doi:10.1080/24694452.2017.1373625

Abstract

Data is the crux of science. The widespread availability of big data today is of particular importance for fostering new forms of geospatial innovation. This article reports a state-of-the-art solution that addresses a key cyberinfrastructure research problem—providing ready access to big, distributed geospatial data resources on the Web. I first formulate this data access problem and introduce its indispensable elements, including identifying the cyberlocation, space and time coverage, theme, and quality of the data set. I then propose strategies to tackle each data access issue and make the data more discoverable and usable for geospatial data users and decision makers. Among these strategies is large-scale Web crawling as a key technique to support automatic collection of online geospatial data that are highly distributed, intrinsically heterogeneous, and known to be dynamic. To better understand the content and scientific meanings of the data, methods including space–time filtering, ontology-based thematic classification, and service quality evaluation are incorporated. To serve a broad scientific user community, these techniques are integrated into an operational data crawling system, PolarHub, which is also an important cyberinfrastructure building block to support effective data discovery. A series of experiments was conducted to demonstrate the outstanding performance of the PolarHub system. This work seems to contribute significantly in building the theoretical and methodological foundation for data-driven geography and the emerging spatial data science.

Full Text