Abstract
Anomaly detection in high dimensional data is becoming a fundamental research problem that has various applications in the real world. However, many existing anomaly detection techniques fail to retain sufficient accuracy due to so-called “big data” characterised by high-volume, and high-velocity data generated by variety of sources. This phenomenon of having both problems together can be referred to the “curse of big dimensionality,” that affect existing techniques in terms of both performance and accuracy. To address this gap and to understand the core problem, it is necessary to identify the unique challenges brought by the anomaly detection with both high dimensionality and big data problems. Hence, this survey aims to document the state of anomaly detection in high dimensional big data by representing the unique challenges using a triangular model of vertices: the problem (big dimensionality), techniques/algorithms (anomaly detection), and tools (big data applications/frameworks). Authors’ work that fall directly into any of the vertices or closely related to them are taken into consideration for review. Furthermore, the limitations of traditional approaches and current strategies of high dimensional data are discussed along with recent techniques and applications on big data required for the optimization of anomaly detection.
Highlights
Many data sets continuously stream from weblogs, financial transactions, health records, and surveillance logs, as well as from business, telecommunication, and biosciences [1, 2]
With the world becoming increasingly data driven and with no generic approach for big data anomaly detection, the problem of high dimensionality is inevitable in many application areas
Identifying anomalous data points across large data sets with high dimensionality issues is a research challenge. This survey has provided a comprehensive overview of anomaly detection techniques related to the big data features of volume and velocity, and has; examined strategies for addressing the problem of high dimensionality
Summary
Many data sets continuously stream from weblogs, financial transactions, health records, and surveillance logs, as well as from business, telecommunication, and biosciences [1, 2]. Another algorithm, called Outlier Detection for Mixed-Attribute Data sets (ODMAD) proposed by Koufakou and Georgiopoulos [61], detects anomalies based on their categorical attributes in the first instance, before focusing on subsets of data in the continuous space by utilizing information about the subsets based on those categorical attributes Their results demonstrated that the speed of the distributed version of ODMAD is close to linear speedup with the number of nodes used in computation. The most common strategies to address the problem of high dimensionality are to locate the most important dimensions (i.e., subset of features), known as the variable selection method, or to combine dimensions into a smaller set of new variables, known as the dimensionality reduction method [73] These strategies become highly complex and ineffective in detecting anomalies due to the challenges associated with large data sets and data streams. They compared D-cube technique across state-of-the-art methods and proved that D-cube is memory efficient, they have proved it accurate by successfully
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have