Towards Accurate DGA Detection based on Siamese Network with Insufficient Training Samples

Xiaoyan Hu,Guang Cheng,Jian Gong,Miao Li,Hua Wu,Ruidong Li

doi:10.1109/icc45855.2022.9838409

Abstract

Domain Generation Algorithms (DGAs) are widely applied in diversified malicious attack patterns such as botnets. Attacks utilize DGAs to dynamically create pseudorandom domains to evade security detection and successfully connect bots with Command and Controls (C&C) servers. The detection of Algorithmically Generated Domains (AGDs) plays an essential role in network attack detection. Most of the existing DGA detectors are machine learning or deep learning-based methods. However, these DGA detectors perform relatively poorly with insufficient training samples, such as small-scale DGA families and emerging DGA variants. Besides, machine learning-based detectors require sophisticated and time-consuming artificial feature extraction, and attackers can circumvent the extracted features. This paper focuses on accurately detecting DGAs based on siamese network with insufficient training samples. Our proposed DGA detection method is referred to as DGAD-SN. DGAD-SN first introduces contrastive learning and adopts the siamese network framework to construct the feature extractor, which excavates the implicit relationship information between characters in the domain name strings using limited training samples. Then machine learning-based DGA classifiers are trained based on the extracted neural feature vectors of domain names to identify AGDs. Our experimental studies suggest that DGAD-SN can efficiently extract distinguishable neural feature vectors for domain names and outperforms state-of-the-art DGA detectors in identifying small-scale DGA families or emerging DGA variants. Its average accuracy is 10%−15% higher than conventional machine learning-based detection methods and about 1%−2% higher than deep learning-based detection methods using limited training samples.

Full Text