Improving Detection Accuracy for Imbalanced Network Intrusion Classification using Cluster-based Under-sampling with Random Forests

Md Ochiuddin Miah,Dewan Md Farid,Sakib Shahriar Khan,Swakkhar Shatabda

doi:10.1109/icasert.2019.8934495

Abstract

Network intrusion classification i n t he imbalanced big data environment becomes a significant and important issue in information and communications technology (ICT) in this digital era. Presently, intrusion detection systems (IDSs) are commonly using tool to detect and prevent internal and external network attacks/intrusions. IDSs are majorly bifurcated into host-based and network-based systems, and use pattern-matching techniques to detect intrusions that known as misuse-based intrusion detection system. Machine learning (ML) and data mining (DM) algorithms are widely using for classifying intrusions in IDS over the last few decades. One of the major challenges for building IDS employing machine learning and data mining algorithms is to improve the intrusion classification accuracy and also reducing the false-positive rate. In this paper, we have introduced a new method for improving detection rate to classify minority-class network attacks/ intrusions using cluster-based under-sampling with Random Forest classifier. The proposed method is a multi-layer classification approach, which can process the highly imbalanced big data to correctly identify the minority/ rare class-intrusions. Initially, the proposed method classify a data point/ incoming data is attack/ intrusion or not (like normal behaviour), if it’s an attack then the proposed method try to classify attack type and later sub-attack type. We have used cluster-based under-sampling technique to deal with class-imbalanced problem and popular ensemble classifier Random Forest for addressing overfitting problem. We have used KDD99 intrusion detection benchmark dataset for experimental analysis and tested the performance of proposed method with existing machine learning algorithms like: Artificial N eural Network (ANN), naive Bayes (NB) classifier, Random Forest, and Bagging techniques.

Full Text