Some problems involving the selection of samples from undisclosed groups are relevant in various areas such as health, statistics, economics, and computer science. For instance, when selecting a sample from a population, well-known strategies include simple random and stratified random selection. Another related problem is selecting the initial points corresponding to samples for the K-means clustering algorithm. In this regard, many studies propose different strategies for choosing these samples. However, there is no consensus on the best or most effective approaches, even when considering specific datasets or domains. In this work, we present a new strategy called the Sample of Groups (SOG) Algorithm, which combines concepts from grid, density, and maximum distance clustering algorithms to identify representative points or samples located near the center of the cluster mass. To achieve this, we create boxes with the right size to partition the data and select the representatives of the most relevant boxes. Thus, the main goal of this work is to find quality samples or seeds of data that represent different clusters. To compare our approach with other algorithms, we not only utilize indirect measures related to K-means but also employ two direct measures that facilitate a fairer comparison among these strategies. The results indicate that our proposal outperforms the most commonly used algorithms.
Read full abstract