Abstract

Research on soft-clustering has not been explored much compared to hard-clustering. Soft-clustering algorithms are important in solving complex clustering problems. One of the soft-clustering methods is the Gaussian Mixture Model (GMM). GMM is a clustering method to classify data points into different clusters based on the Gaussian distribution. This study aims to determine the number of clusters formed by using the GMM method. The data used in this study is synthetic data on water quality indicators obtained from the Kaggle website. The stages of the GMM method are: imputing the Not Available (NA) value (if there is an NA value), checking the data distribution, conducting a normality test, and standardizing the data. The next step is to estimate the parameters with the Expectation Maximization (EM) algorithm. The best number of clusters is based on the biggest value of the Bayesian Information Creation (BIC). The results showed that the best number of clusters from synthetic data on water quality indicators was 3 clusters. Cluster 1 consisted of 1110 observations with low-quality category, cluster 2 consisted of 499 observations with medium quality category, and cluster 3 consisted of 1667 observations with high-quality category or acceptable. The results of this study recommend that the GMM method can be grouped correctly when the variables used are generally normally distributed. This method can be applied to real data, both in which the variables are normally distributed or which have a mixture of Gaussian and non-Gaussian.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call