A Comparative Study of Machine Learning Algorithms and Their Ensembles for Botnet Detection

Songhui Ryu,Baijian Yang

doi:10.4236/jcc.2018.65010

Abstract

A Botnet is a network of compromised devices that are controlled by malicious “botmaster” in order to perform various tasks, such as executing DoS attack, sending SPAM and obtaining personal data etc. As botmasters generate network traffic while communicating with their bots, analyzing network traffic to detect Botnet traffic can be a promising feature of Intrusion Detection System. Although such system has been applying various machine learning techniques, comparison of machine algorithms including their ensembles on botnet detection has not been figured out. In this study, not only the three most popular classification machine learning algorithms—Naive Bayes, Decision tree, and Neural network are evaluated, but also the ensemble methods known to strengthen classifier are tested to see if they indeed provide enhanced predictions on Botnet detection. This evaluation is conducted with the CTU-13 public dataset, measuring the training time of each classifier and its F measure and MCC score.

Highlights

As a network of compromised devices called bots, a botnet executes malicious tasks under the control of the attacker, a botmaster
A Botnet is a network of compromised devices that are controlled by malicious “botmaster” in order to perform various tasks, such as executing Denial of Service (DoS) attack, sending SPAM and obtaining personal data etc
Random forest is known as a way of avoiding overfitting that can happen in a single decision tree [16]

Summary

Introduction

As a network of compromised devices called bots, a botnet executes malicious tasks under the control of the attacker, a botmaster. [3], the primary goals of botnets are as follows: Information dispersion: sending SPAM, executing Denial of Service (DoS). MCC which is known as the phi coefficient is considered to be less biased because it incorporates True Negative as well. According to [22], MCC is more robust to an imbalanced data where classification methods tend to biased toward the majority class than the F1 score or accuracy. To evaluate the classification algorithms along with ensemble methods for the CTU-13 dataset, Scikit-learn on a single core of Intel Xeon-E5 with 64GB of memory was used. Because some of the features were categorical which Scikit-learn cannot handle properly, data preparation including encoding and standardization were conducted

Methods

Results

Conclusion