Overview and Exploratory Analyses of CICIDS 2017 Intrusion Detection Dataset

Akinyemi Oyelakin,Olufadi H.i Olufadi H.I,Ogundele T.s Ogundele T.S,Salau-Ibrahim T Salau-Ibrahim T,Abdulrauf U.t Abdulrauf U.T,Ajiboye I.k Ajiboye I.K,Ameen A.o Ameen A.O,Adeniji I A Adeniji I A,Muhammad-Thani S Muhammad-Thani S

doi:10.29207/joseit.v2i2.5411

Abstract

Intrusion detection systems are used to detect attacks on a network. Machine learning (ML) approaches have been widely used to build such intrusion detection systems (IDSs) because they are more accurate when built from a very large and representative dataset. Recently, one of the benchmark datasets that are used to build ML-based intrusion detection models is the CICIDS2017 dataset. The data set is contained in eight groups and was collected from the Data Set & Repository of the Canadian Institute of Cyber Security. The data set is available in both PCAP and net flow formats. This study used the net flow records in the CIDIDS2017 dataset, as they were found to contain newer attacks, very large, and useful for traffic analysis. Exploratory data analysis (EDA) techniques were used to reveal various characteristics of the dataset. The general objective is to provide more insight into the nature, structure, and issues of the data set so as to identify the best ways to use it to achieve improved ML-based IDS models. Furthermore, some of the open problems that can arise from the use of the dataset in any machine learning-based intrusion detection systems are highlighted and possible solutions are briefly discussed. The EDA techniques used revealed important relationships between the input variables and the target class. The study concluded that the EDA can better influence the decision about future IDS research using the dataset.

Full Text