Abstract

One of the critical problems in the curation of research data is the harmonization of its internal metadata schemata. The value of harmonizing such data is well illustrated by the Berkeley Earth project, which successfully integrated into one metadata schema the raw climate datasets from a wide variety geographical sources and time periods (250 years). Doing this enabled climate scientists to calculate a more accurate estimate of the recent changes in Earth’s average land surface temperatures and to ascertain the extent to which climate change is anthropogenic. This paper surveys some of the approaches that have been taken to the integration of data schemata in general and examines some of the specific metadata features of the source surface temperature datasets that were harmonized by Berkeley Earth. The conclusion drawn from this analysis is that the original source data and the Berkeley Earth common format provides a promising training set on which to apply machine learning methods for replicating the human data integration process. This paper describes research in progress on a domain-independent approach to the metadata harmonization problem that could be applied to other fields of study and be incorporated into a data portal to enhance the discoverability and reuse of data from a broad range of data sources.

Highlights

  • One of the critical features of a research data set is the metadata schema, sometimes referred to as the data format, that specifies the semantics for its data points

  • Some data obtained by researchers in one discipline, such as ecology, may be relevant to another discipline, such as climatology

  • This paper argues for the value and the feasibility of a machine-learning approach for addressing the data harmonization problem

Read more

Summary

Introduction

One of the critical features of a research data set is the metadata schema, sometimes referred to as the data format, that specifies the semantics for its data points. Given a suitably constructed ontology in a specific domain (e.g. the CIDOC Conceptual Reference Model, which provides a common semantic framework for cultural heritage information) it is possible to develop rule-based algorithms to generate candidate crosswalks between schemata (Gaitanou et al, 2012) This approach is only effective if the mediating translation schema is an adequate abstraction of the subject domain. Database schema matching systems and ontology integration systems typically rely on the known schemas source and target schemas to apply linguistic and rule-based approaches to perform the mapping Neither of these strategies generalizes well, in legacy data documentation environments whose interpretation is highly dependent on the software designed to read it, as is typically the case with climate datasets. That multiple crosswalks have been written for the same target metadata schema affords the opportunity to automate the mapping method with machine learning methods

A Machine Learning Approach to Mapping Schemata
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call