Abstract
Text analytics based on supervised machine learning has shown great promise in a multitude of domains but has yet to be applied to seismology. We describe some common classifiers (Naïve Bayes, k-Nearest Neighbors, Support Vector Machines, and Random Forests) as well as the standard steps of supervised learning (training, validation of model parameter adjustments, and testing). To illustrate text classification on a seismological corpus, we use a hundred articles related to the topic of precursory accelerating seismicity, spanning from 1988 to 2010. This corpus was labelled by Mignan [Tectonophysics, 2011] with the precursor whether explained by critical processes (i.e., cascade triggering) or by other processes (such as signature of main fault loading). We investigate how the classification process can be automatized to help analyze larger corpora in order to better understand trends in earthquake predictability research. We find that the Naïve Bayes model performs best, in agreement with the machine learning literature for the case of small datasets, with cross-validation accuracies showing the model’s predictive ability for both binary classification (“critical process” or else) and a multiclass classification (“non-critical process,” “agnostic,” “critical process assumed,” “critical process demonstrated”). Prediction on a dozen of articles published since 2011 shows however a weak generalization, which can be explained, in part, by the empirical variance of the small training set. This preliminary study demonstrates the potential of supervised learning to reveal textual patterns in the seismological literature. Manual labelling remains essential but is made transparent by an investigation of Naïve Bayes keyword posterior probabilities.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.