Abstract

Non-hierarchical procedures usually require the user to specify the number of clusters before any clustering. The problem of deciding on the number of clusters which suitably fit a dataset as well as the evaluation of the clustering results is subjected to rigorous research. Therefore, we propose three methods of testing significance for determining the optimal number of clusters for a given dataset such as elbow, silhouette and gap statistic methods. A total of 52 drugs (known to act against 5-HT receptor) with their properties such as Molecular Weight, logP, Heavy Atoms, H-bond Donors (HBD), H-bond Acceptors (HBA), polar surface area (PSA), number of freely rotatable bonds (RB) and half-life period of the drug created in the form of a table was subsequently used for analysis. Before performing optimal number of clusters, the dataset is tested for clusterability using Hopskin statistic. For 5-HT receptor drug compounds dataset, the Hopkins statistic was found to be 0.2357, which indicates that the data is highly clusterable. Different methods for determining the optimal number of clusters include elbow and silhouette methods as well as gap statistic. It is evidenced that none of the methods is able to reach a consensus and estimate the number of optimal clusters. Therefore, NbClust package with 30 indices showed consensus toward the identification of the optimal number of clusters, k for the 5-HT receptor dataset where it resulted in 3 cluster solutions by maximum indices.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call