Abstract

Large-volume and high-dimensional big datasets are being generated quickly. They are expected to provide data-driven solutions for various pressing challenges such as cybersecurity, credit card fraud, network intrusion detection, etc. The visual assessment of clustering tendency (VAT) family of algorithms provides an excellent tool for assessing clustering tendency and subsequent clustering of these novel datasets to extract groups and anomalies to better understand the data generation phenomenon. However, VAT and improved VAT (iVAT) algorithms are plagued with O(n2) time and space complexity as they use Prim's algorithm for building Euclidean minimum spanning tree (EMST) of the data points, leaving it inapplicable to large datasets. This paper develops three novel time- and memory-scalable algorithms for fast computation of VAT EMST by reducing the search space of the following possible EMST edge and using an efficient data structure, the k-d tree. Next, we develop a novel approach to compute iVAT reordered dissimilarity image (RDI) in a novel memory-preserving manner using the EMST edge lengths calculated previously without having to compute memory intensive n×n VAT RDI. We experimented on several synthetic and real-life datasets to showcase that the proposed approaches can help quickly assess their clustering tendency in a memory-saving manner. We also demonstrate the applicability of the proposed methods for the anomaly detection task in large volumes of high-dimensional data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.