A comparison of two hybrid ensemble techniques for network anomaly detection in spark distributed environment

Gagandeep Kaur

doi:10.1016/j.jisa.2020.102601

Abstract

In this paper, the authors have compared ensemble methods in Spark supported distributed environment. With ever changing attack trends traditional machine learning algorithms fail to detect new types of network based attacks. Machine learning techniques therefore need to be improved. Secondly, there is need for faster and accurate detection algorithms and study of distributed frameworks like Apache Spark is much needed. Thirdly, dataset size reduction plays major role in machine learning algorithms and therefore effort is required to reduce data sizes without affecting the performance metrics. In this work KMeans Clustering and GMM based Clustering have been used to reduce the dataset size while maintaining the diversity of the traffic. The clustered data acts as input to Random Forest Classifier. The RF classification has also been done for class-wise detection of attacks. The outputs from KMeans based RF classification, GMM based classification and class-wise RF classifications were taken as input for base learners of ensemble methods. Two ensemble methods, namely, Weighted Voting based AdaBoostensemble and Stacking based ensemble have been studied and compared. Two dataset, namely, NSL-KDD and UNSW-NB15 have been used to carry out the study. An accuracy of 78.9% and 58.54% for KDDTest+ and KDDTest-21 with KM+RF was achieved. An accuracy of 79.98% and 63.19% were achieved with GMM+RF. Furthermore, an accuracy of 82% was achieved for UNSW-NB15 with KM+RF whereas an accuracy of 84% was achieved for the same with GMM+RF.With Weighted Voting based AdaBoost ensemble accuracies of 90.46% and 83.32% for KDDTest+ and KDDTest-21 were achieved respectively. Similarly an accuracy of 91.31% was achieved for UNSW-NB15 Test data with Weighted Voting based AdaBoost ensemble. With Stacking based ensemble accuracies of 85.24% and 78.20% were achieved for KDDTest+ and KDDTest-21 respectively. Lastly an accuracy of 89.57% was achieved with Stacking based ensemble for UNSW-NB15 Test dataset. Overall we were able to achieve better detection rates and accuracies with reduced false alarm rates by using ensemble methods. Tests were conducted on different machines by varying the number of executor cores to study time latency in distributed Spark environment.

Full Text