Advanced Cyberinfrastructure to Enable Search of Big Climate Datasets in THREDDS

Gaigalas Gaigalas,Sun Sun,Di Di

doi:10.3390/ijgi8110494

Gaigalas Gaigalas, Sun Sun + Show 1 more

Open Access

https://doi.org/10.3390/ijgi8110494

Copy DOI

Abstract

Understanding the past, present, and changing behavior of the climate requires close collaboration of a large number of researchers from many scientific domains. At present, the necessary interdisciplinary collaboration is greatly limited by the difficulties in discovering, sharing, and integrating climatic data due to the tremendously increasing data size. This paper discusses the methods and techniques for solving the inter-related problems encountered when transmitting, processing, and serving metadata for heterogeneous Earth System Observation and Modeling (ESOM) data. A cyberinfrastructure-based solution is proposed to enable effective cataloging and two-step search on big climatic datasets by leveraging state-of-the-art web service technologies and crawling the existing data centers. To validate its feasibility, the big dataset served by UCAR THREDDS Data Server (TDS), which provides Petabyte-level ESOM data and updates hundreds of terabytes of data every day, is used as the case study dataset. A complete workflow is designed to analyze the metadata structure in TDS and create an index for data parameters. A simplified registration model which defines constant information, delimits secondary information, and exploits spatial and temporal coherence in metadata is constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to search the big climatic datasets in near real-time. The proposed approach has been tested on UCAR TDS and the results prove that it achieves its design goal by at least boosting the crawling speed by 10 times and reducing the redundant metadata from 1.85 gigabytes to 2.2 megabytes, which is a significant breakthrough for making the current most non-searchable climate data servers searchable.

Highlights

Cyberinfrastructure plays an important role in today’s climate research activities [1,2,3,4,5,6]
We extended CyberConnector to support accessing metadata harvested and indexed by the thredds-crawler described in the previous section
The searching capabilities on the two datasets were established in the EarthCube CyberConnector

Summary

Introduction

Cyberinfrastructure plays an important role in today’s climate research activities [1,2,3,4,5,6]. The big data challenges of volume, velocity, variety, veracity, and value (5Vs), have pushed geoscientific research into a more collaborative endeavor that involves many observational data providers, cyberinfrastructure developers, modelers, and information stakeholders [9]. Most scientists acquire their knowledge about datasets via conferences, colleague recommendations, textbooks, and search engines They become very familiar with the datasets they use, and every time they want to retrieve the data, they go directly to the dataset website to download the data falling within the requested time and spatial windows. These routines are less sustainable as the sensors/datasets become more varied, models evolve more frequently, and new data pertaining to their research is available somewhere else [9]. The key difference is the bibliographic approach works with distinct information entities of limited types, while the data management approach works with models of data/information structures and their relationships

Objectives

Methods

Results

Discussion

Conclusion