Abstract

The Domain Name Service (DNS) is a central point in the functioning of the internet. Just as organizations use domain names to enable the access to their computational services, malicious actors make use of domain names to point to the services under their control. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions. Most of these are based on known malicious domain lists that are being constantly updated. However, in this way, it is only possible to block known malicious communications, leaving out many others that can be malicious but are not known. Adopting machine learning to classify domains contributes to the detection of domains that are not yet on the block list. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. The dataset was created from scratch, using publicly DNS logs of both malicious and non-malicious domain names. Using the domain name as input, 34 features were obtained. Features like the domain name entropy, number of strange characters and domain name length were obtained directly from the domain name. Other features like, domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). The class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names.

Highlights

  • The Domain Name Service (DNS) is a central point in the functioning of the internet

  • The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names

  • The dataset was created from scratch, using publicly DNS logs of both malicious and nonmalicious domain names

Read more

Summary

DNS dataset for malicious domains detection

Cláudio Marques a,∗, Silvestre Malta b, João Paulo Magalhães c a Escola Superior de Tecnologia e Gestão, Politécnico de Viana do Castelo, Viana do Castelo 4900-348, Portugal b ADiT-Lab, Escola Superior de Tecnologia e Gestão, Politécnico de Viana do Castelo, Viana do Castelo 4900-348, Portugal c CIICESI, Escola Superior de Tecnologia e Gestão, Politécnico do Porto, Felgueiras, Portugal article info. Distinguishing between non-malicious and malicious domain names is extremely important, as it allows to grant or block the access to external services, maximizing the security of the organization and users. Nowadays there are many DNS firewall solutions Most of these are based on known malicious domain lists that are being constantly updated. The dataset described in this manuscript is meant for supervised machine learning-based analysis of malicious and non-malicious domain names. Domain name creation date, Internet Protocol (IP), open ports, geolocation were obtained from data enrichment processes (e.g. Open Source Intelligence (OSINT)). Malta and J.P. Magalhães / Data in Brief 38 (2021) 107342 class was determined considering the data source (malicious DNS log files and non-malicious DNS log files). The dataset consists of data from approximately 90000 domain names and it is balanced between 50% non-malicious and 50% of malicious domain names

Data source location Data accessibility
Value of the Data
Values description
Findings
CRediT Author Statement

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.