Abstract

To ensure seamless, programmatic access to data for High Performance Computing (HPC) and analysis across multiple research domains, it is vital to have a methodology for standardization of both data and services. At the Australian National Computational Infrastructure (NCI) we have developed a Data Quality Strategy (DQS) that currently provides processes for: (1) Consistency of data structures needed for a High Performance Data (HPD) platform; (2) Quality Control (QC) through compliance with recognized community standards; (3) Benchmarking cases of operational performance tests; and (4) Quality Assurance (QA) of data through demonstrated functionality and performance across common platforms, tools and services. By implementing the NCI DQS, we have seen progressive improvement in the quality and usefulness of the datasets across the different subject domains, and demonstrated the ease by which modern programmatic methods can be used to access the data, either in situ or via web services, and for uses ranging from traditional analysis methods through to emerging machine learning techniques. To help increase data re-usability by broader communities, particularly in high performance environments, the DQS is also used to identify the need for any extensions to the relevant international standards for interoperability and/or programmatic access.

Highlights

  • The National Computational Infrastructure (NCI) manages one of Australia’s largest and more diverse repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences [1]

  • All Quality Control (QC)/Quality Assurance (QA) reports and benchmarks are shared with the data producers

  • The NCI Data Quality Strategy (DQS) has been applied to Climate and Weather, Earth Observation, Geoscience and Astronomy data with the QC and QA tests adapted to meet the relevant community standards and protocols for each domain

Read more

Summary

Introduction

The National Computational Infrastructure (NCI) manages one of Australia’s largest and more diverse repositories (10+ PBytes) of research data collections spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences [1]. Within these domains, data can be of different types such as gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as having diverse coordinate reference projections and resolutions. A set of standards and ‘best practices’ for ensuring the quality of scientific data products is a critical component in the life cycle of data management.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.