Abstract

Anomaly detection in high dimensional data is a critical research issue with serious implication in the real-world problems. Many issues in this field still unsolved, so several modern anomaly detection methods struggle to maintain adequate accuracy due to the highly descriptive nature of big data. Such a phenomenon is referred to as the “curse of dimensionality” that affects traditional techniques in terms of both accuracy and performance. Thus, this research proposed a hybrid model based on Deep Autoencoder Neural Network (DANN) with five layers to reduce the difference between the input and output. The proposed model was applied to a real-world gas turbine (GT) dataset that contains 87620 columns and 56 rows. During the experiment, two issues have been investigated and solved to enhance the results. The first is the dataset class imbalance, which solved using SMOTE technique. The second issue is the poor performance, which can be solved using one of the optimization algorithms. Several optimization algorithms have been investigated and tested, including stochastic gradient descent (SGD), RMSprop, Adam and Adamax. However, Adamax optimization algorithm showed the best results when employed to train the DANN model. The experimental results show that our proposed model can detect the anomalies by efficiently reducing the high dimensionality of dataset with accuracy of 99.40%, F1-score of 0.9649, Area Under the Curve (AUC) rate of 0.9649, and a minimal loss function during the hybrid model training.

Highlights

  • Nowadays, a huge amount of data is produced periodically at an unparalleled speed from diverse and composite origins such as social media, sensors, telecommunication, financial transactions, etc. [1,2]

  • Hautamaki et al [20] proposed an Outlier Detection using Indegree Number (ODIN), which is based on kNN graph, whereby data instances are segregated based on their respective influence in its neighborhood

  • This study proposed an efficient and improved deep autoencoder based anomaly detection approach in real industrial gas turbine data set

Read more

Summary

Introduction

A huge amount of data is produced periodically at an unparalleled speed from diverse and composite origins such as social media, sensors, telecommunication, financial transactions, etc. [1,2]. Big data is conceptualized as the 5 Vs (Value, Veracity, Variety, Velocity and Volume) [5]. High dimensionality can affect data analytics such as anomaly detection in a large dataset. The anomaly is scored as the average or weighted distance between the data object and its k nearest neighbors [19,21]. Hautamaki et al [20] proposed an Outlier Detection using Indegree Number (ODIN), which is based on kNN graph, whereby data instances are segregated based on their respective influence in its neighborhood. It is worth mentioning that all the above-mentioned neighbor-based detection methods are independent of data distributions and can detect isolated entities Their success is heavily reliant on distance scales, which is unreliable or insignificant in the highdimensional spaces. The underlying assumption is that if the same process created two objects, they would most likely become nearest neighbors or have similar neighbors [37]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call