Large-volume and high-dimensional big datasets are being generated quickly. They are expected to provide data-driven solutions for various pressing challenges such as cybersecurity, credit card fraud, network intrusion detection, etc. The visual assessment of clustering tendency (VAT) family of algorithms provides an excellent tool for assessing clustering tendency and subsequent clustering of these novel datasets to extract groups and anomalies to better understand the data generation phenomenon. However, VAT and improved VAT (iVAT) algorithms are plagued with O(n2) time and space complexity as they use Prim's algorithm for building Euclidean minimum spanning tree (EMST) of the data points, leaving it inapplicable to large datasets. This paper develops three novel time- and memory-scalable algorithms for fast computation of VAT EMST by reducing the search space of the following possible EMST edge and using an efficient data structure, the k-d tree. Next, we develop a novel approach to compute iVAT reordered dissimilarity image (RDI) in a novel memory-preserving manner using the EMST edge lengths calculated previously without having to compute memory intensive n×n VAT RDI. We experimented on several synthetic and real-life datasets to showcase that the proposed approaches can help quickly assess their clustering tendency in a memory-saving manner. We also demonstrate the applicability of the proposed methods for the anomaly detection task in large volumes of high-dimensional data.
Read full abstract