Using t-distributed Stochastic Neighbor Embedding (t-SNE) for cluster analysis and spatial zone delineation of groundwater geochemistry data

Honghua Liu,Jing Yang,Ming Ye,Scott C James,Zhonghua Tang,Jie Dong,Tongju Xing

doi:10.1016/j.jhydrol.2021.126146

Abstract

Cluster analysis is a valuable tool for understanding spatial and temporal patterns (e.g., spatial zones) of groundwater geochemistry. To determine cluster numbers and cluster memberships that are unknown in real-world problems, a number of methods have been used to assist cluster analysis, among which graphic approaches are popular and intuitive. This study introduced, for the first time, the t-distributed Stochastic Neighbor Embedding (t-SNE) method as a graphic approach to assist cluster analysis for groundwater geochemistry data. The hierarchical cluster analysis (HCA) was applied to original groundwater geochemistry data, and t-SNE was used to help determine the number of cluster and cluster memberships. Afterward, t-SNE was used to help delineate spatial zones of groundwater geochemistry. The t-SNE-based cluster visualization was compared to the visualization based on principal component analysis (PCA). By applying HCA, PCA, and t-SNE to three geochemical datasets (Oslo transect, Taiyuan karst water, and Jianghan Plain groundwater datasets, which are characterized by different number of samples and features collected across different space and time scales), we found that t-SNE outperformed PCA to assist HCA as a promising tool for helping determine the number of HCA clusters and delineate spatial zones of groundwater geochemistry. It should be noted that t-SNE alone cannot be used for cluster analyses, partly because t-SNE visualization depends on a hyperparameter called perplexity that is a priori unknown for real-world problems. The perplexity values used in this study were determined empirically, and a small value of 0.1 was used for the Taiyuan karst water dataset with 14 samples. For the other two datasets with hundreds of samples, the corresponding perplexity values were 20 and 30, within the range of 5 – 50 commonly used in t-SNE.

Full Text