INSTANCE – the Italian seismic dataset for machine learning

Alberto Michelini,Spina Cianetti,Carlo Giunchi,Dario Jozinović,Valentino Lauciani,Sonja Gaviano

doi:10.5194/essd-13-5509-2021

Abstract

Abstract. The Italian earthquake waveform data are collected here in a dataset suited for machine learning analysis (ML) applications. The dataset consists of nearly 1.2 million three-component (3C) waveform traces from about 50 000 earthquakes and more than 130 000 noise 3C waveform traces, for a total of about 43 000 h of data and an average of 21 3C traces provided per event. The earthquake list is based on the Italian Seismic Bulletin (http://terremoti.ingv.it/bsi, last access: 15 February 2020) of the Istituto Nazionale di Geofisica e Vulcanologia between January 2005 and January 2020, and it includes events in the magnitude range between 0.0 and 6.5. The waveform data have been recorded primarily by the Italian National Seismic Network (network code IV) and include both weak- (HH, EH channels) and strong-motion (HN channels) recordings. All the waveform traces have a length of 120 s, are sampled at 100 Hz, and are provided both in counts and ground motion physical units after deconvolution of the instrument transfer functions. The waveform dataset is accompanied by metadata consisting of more than 100 parameters providing comprehensive information on the earthquake source, the recording stations, the trace features, and other derived quantities. This rich set of metadata allows the users to target the data selection for their own purposes. Much of these metadata can be used as labels in ML analysis or for other studies. The dataset, assembled in HDF5 format, is available at http://doi.org/10.13127/instance (Michelini et al., 2021).

Highlights

Important breakthroughs in the understanding of earthquake phenomena can be achieved through the analysis of the very large number of continuous waveform recordings stored in the existing seismic archives
The use of sophisticated and optimized machine learning (ML) algorithms for the analysis of large amounts of seismic data can lead to remarkable improvements for automated tasks like seismic waveform onset picking, ground motion prediction, and earthquake early warning; for the detection of hidden signals currently recognized as noise; or for novel modeling and inversion strategies
Zhu et al (2019), Mousavi et al (2020), and Mousavi and Beroza (2020) are excellent examples of successful applications of ML which can improve substantially the earthquake detection level with respect to most traditional methods, leading to the location of tiny and previously undetected earthquakes improving our knowledge on the heterogeneity of stress release on known and unknown faults

Summary

Introduction

Important breakthroughs in the understanding of earthquake phenomena can be achieved through the analysis of the very large number of continuous waveform recordings stored in the existing seismic archives. To this end, it can be important to make available well-organized representative subsets of the archives together with their associated metadata information. The introduction of competitions like those for predicting laboratory earthquakes launched on the Kaggle platform (https:// www.kaggle.com/c/LANL-Earthquake-Prediction/data, last access: 19 November 2021) or the SeismOlympics (Fang et al, 2017), which attracted several thousand teams, evidences even more the great potential of benchmark datasets (Johnson et al, 2021) and the general interest to tackle seismology problems with ML

Objectives

Results

Discussion

Conclusion