Abstract

This paper presents a review of three datasets, namely KDD Cup '99, NSL-KDD and Kyoto 2006+ datasets, which are widely used in researching intrusion detection in computer networks. The KDD Cup '99 dataset consists of five million records, each containing 41 features which can classify malicious attacks into four classes: Probe, DoS, U2R and R2L. The KDD Cup '99 dataset cannot reflect real traffic data since it was generated by simulation over a virtual computer network. In the NSL-KDD dataset, redundant and duplicate records form the KDD Cup '99 dataset are removed from training and test sets, respectively. The Kyoto 2006+ dataset is built on real three year-network traffic data which are labeled as normal (no attack), attack (known attack) and unknown attack. The Kyoto 2006+ dataset contains 14 statistical features derived from the KDD Cup '99 dataset and 10 additional features.

Highlights

  • Intrusion can be understood as an attempt to violate information protection, data integrity and resource accessibility (Protić, 2016, pp.483495)

  • Since the KDD Cup ‘99 dataset is a simulation of network traffic, there is a huge number of redundant records in the training set and duplicate records in the test set which prevent classifying the other records which are not redundant

  • This paper presents a review and a comparative analysis of KDD Cup ’99, NSL-KDD and Kyoto 2006+ datasets

Read more

Summary

Introduction

Intrusion can be understood as an attempt to violate information protection, data integrity and resource accessibility (Protić, 2016, pp.483495). Since the KDD Cup ‘99 dataset is a simulation of network traffic, there is a huge number of redundant records in the training set and duplicate records in the test set which prevent classifying the other records which are not redundant. The number of records in the training and test sets is reasonable Both KDD Cup ‘99 dataset and NSL-KDD dataset do not reflect real data flow in computer network since they are generated by simulation over the virtual network. The KDD Cup ‘99 dataset is a collection of data transfer from virtual environment to be used for the Competition of the Third Knowledge Discovery and Data Mining Tools (KDD CUP ‘99 dataset, 1999) It is the subset of 1998 DARPA dataset that was collected by simulation of the operation of a typical US Air Force LAN with multiple attacks and acquired nine weeks of TCP dump data. The whole KDD Cup ‘99 dataset contains 4,898,431 single connection records, each of which consists of 41 features labeled as normal or attacks (See Table 1)

Feature name duration protocol type service flag source bytes destination bytes
Test set
Total number of instances in the test set
Other systems
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call