An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China.

Hui Zou,Xiaojing Wang,Zhihong Zou

doi:10.3390/ijerph121114400

Abstract

The increase and the complexity of data caused by the uncertain environment is today’s reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006–2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality.

Highlights

The evaluation of water quality is essentially a classification problem [1]
Various cluster validity measures can be used to evaluate the performance of a clustering algorithm [27]
Square within-cluster Error (SSE) is especially important because the real world clustering applications seldom reveal information about the class labels of data

Summary

Introduction

Due to the fact that current water quality assessment standards are not uniform, research on unsupervised methods is quite active. There are two common methods of unsupervised classification, namely, cluster analysis (CA), specially hierarchical cluster analysis (HCA), and principal component analysis (PCA). These methods have been widely used in water quality management [2,3,4,5,6], but owing to the increase and the complexity of data in the water environment, water quality evaluation using these methods faces much pressure in data handling. In K-means clustering, the Euclidean distances with equal weights method is widely used [8,9,10]. Weights have been calculated by the superscale, which is the ratio of the value of every indicator at each monitoring point over the corresponding water quality standard [12,13]

Methods

Results

Conclusion