Objectively Determining the Number of Similar Hydrographic Clusters with Unsupervised Machine Learning

Carola Trahms,Arne Biastoch,Yannick Wölker

doi:10.5194/egusphere-egu23-11687

Carola Trahms, Arne Biastoch + Show 1 more

https://doi.org/10.5194/egusphere-egu23-11687

Copy DOI

Export

Save

Cite

Publication Date: May 15, 2023

Abstract
Full-Text
Similar Papers

Abstract

Listen

Determining the number of existing water masses and defining their boundaries is subject to ongoing discussion in physical oceanography.&#160;Traditionally, water masses are defined manually by experts setting constraints based on experience and previous knowledge about the hydrographic properties describing them. In recent years, clustering, an unsupervised machine learning approach, has been introduced as a tool to determine clusters, i.e., volumes, with similar hydrographic properties without explicitly defining their hydrographic constraints. However, the exact number of clusters to be looked for is set manually by an expert up until now.&#160;We propose a method that determines a fitting number of clusters for hydrographic clusters in a data driven way.&#160;In a first step, the method averages the data in different-sized slices along the time or depth axis as the structure of the hydrographic space changes strongly either in time or depth. Then the method applies clustering algorithms on the averaged data and calculates off-the-shelf evaluation scores (Davies-Bouldin, Calinski-Harabasz, Silhouette Coefficient) for several predefined numbers of clusters. In the last step, the optimal number of clusters is determined by analyzing the cluster evaluation scores across different numbers of clusters for optima or relevant changes in trend.&#160;For validation we applied this method to the output for the subpolar North Atlantic between 1993 and 1997 of the high-resolution Atlantic Ocean model VIKING20X, in direct exchange with domain experts to discuss the resulting clusters. Due to the change from strong to weak deep convection in these years, the hydrographic properties vary strongly in the time and depth dimension, providing a specific challenge to our methodology.&#160;Our findings suggest that it is possible to identify an optimal number of clusters using the off-the-shelf cluster evaluation scores that catch the underlying structure of the hydrographic space. The optimal number of clusters identified by our data-driven method agrees with the optimal number of clusters found by expert interviews.&#160;These findings contribute to aiding and objectifying water mass definitions across multiple expert decisions, and demonstrate the benefit of introducing data science methods to analyses in physical oceanography.

Full Text