Inline Detection of DGA Domains Using Side Information

Raaghavi Sivaguru,Martine De Cock,Jonathan Peck,Femi Olumofin,Anderson Nascimento

doi:10.1109/access.2020.3013494

Abstract

Malware applications typically use a command and control (C&C) server to manage bots to perform malicious activities. Domain Generation Algorithms (DGAs) are popular methods for generating pseudo-random domain names that can be used to establish a communication between an infected bot and the C&C server. In recent years, machine learning based systems have been widely used to detect DGAs. There are several well known state-of-the-art classifiers in the literature that can detect DGA domain names in real-time applications with high predictive performance. However, these DGA classifiers are highly vulnerable to adversarial attacks in which adversaries purposely craft domain names to evade DGA detection classifiers. In our work, we focus on hardening DGA classifiers against adversarial attacks. To this end, we train and evaluate state-of-the-art deep learning and random forest (RF) classifiers for DGA detection using side information that is harder for adversaries to manipulate than the domain name itself. Additionally, the side information features are selected such that they are easily obtainable in practice to perform inline DGA detection. The performance and robustness of these models is assessed by exposing them to one day of real-traffic data as well as domains generated by adversarial attack algorithms. We found that the DGA classifiers that rely on both the domain name and side information have high performance and are more robust against adversaries.

Highlights

Domain Generation Algorithms (DGAs) are subroutines that generate pseudo-random combinations of characters or words, and output domain name strings [1]
The domains from the list that have not been registered by the botmaster will typically result in an NXDomain response when queried, and can be discarded by the infected machine. This technique is often used by a command and control (C&C) center and an infected bot to establish communication and perform malicious activities as instructed by the C&C server
Since we only focus on performing inline DGA detection in our work, we do not use the timestamp feature to perform DGA classification

Summary

INTRODUCTION

Domain Generation Algorithms (DGAs) are subroutines that generate pseudo-random combinations of characters or words, and output domain name strings [1]. The advantage of the former approach is that it does not require gathering of additional information, which may be expensive to collect in real time, and that it allows the defenders to detect the DGA domain names and block them even before they can be resolved. The advantage of the latter approach is that side information is a lot harder for the attacker to manipulate than the domain name string itself, making machine learning models trained on side information potentially more robust against adversarial attacks. All findings are based on the first author’s master’s thesis, the full version of which is available at the University of Washington [25]

RELATED WORK

LEXICAL FEATURES

DGA CLASSIFIERS

EXPERIMENTAL RESULTS

CONCLUSION