Machine learning, Deep learning, and water quality data have been used in recent years to predict the outbreak of harmful algae, especially Microcystis, and analyze outbreak causes. However, for various reasons, water quality data are often High-Dimension, Low-Sample- Size (HDLSS), meaning the sample size is lower than the number of dimensions. Moreover, imbalance problems may arise due to bias in the occurrence frequency of Microcystis. These problems make predicting the occurrence of Microcystis and analyzing its causes with machine learning difficult. In this study, a machine learning model that applies Feature Engineering (FE) and Feature Selection (FS) algorithms are used to predict outbreaks of Microcystis and analyze the outbreak factors from imbalanced HDLSS water quality data. The prediction performance was verified with binary classification to determine whether Microcystis would occur in the future by applying three machine learning models to four data patterns. The cause analysis of Microcystis occurrence was performed by visualizing the results of applying FE and FS. For the test data, the predictive performance of FE and FS methods was significantly better than that of the conventional method, with an accuracy of .108 points and an F-value of .691 points higher than the conventional method. A prediction performance increase was observed with a smaller model capacity. Data-driven analysis suggested that total nitrogen, chemical oxygen demand, chlorophyll-a, dissolved oxygen saturation, and water temperature are associated with Microcystis occurrences. The results also indicated that basic statistics of the water quality distribution (especially mean, standard deviation, and skewness) over a year, not the concentrations of water components, are related to the occurrence of Microcystis. These are new findings not found in previous studies and are expected to contribute significantly to future studies of algae. This study provides a method for analyzing water quality data with high-dimensionality and small samples, imbalance problems, or both.
Read full abstract