Deconstructing Domain Names to Reveal Latent Topics

Cheryl J Flynn,Kenneth E Shirley,Wei Wang

doi:10.1109/dsaa.2016.63

Abstract

Measurement of the lexical properties of domain names enables many types of relatively fast, lightweight web mining analyses. These include unsupervised learning tasks such as automatic categorization and clustering of websites, as well as supervised learning tasks, such as classifying websites as malicious or benign. In this paper we explore whether these tasks can be better accomplished by identifying semantically coherent groups of words in a large set of domain names using a combination of word segmentation and topic modeling methods. By segmenting domain names to generate a large set of new domain-level features, we compare three different unsupervised learning methods for identifying topics among domain name keywords: spherical k-means clustering (SKM), Latent Dirichlet Allocation (LDA), and the Biterm Topic Model (BTM). We successfully infer semantically coherent groups of words in two independent data sets, finding that BTM topics are quantitatively the most coherent. Using the BTM, we compare inferred topics across data sets and across time periods, and we also highlight instances of homophony within the topics. Finally, we show that the BTM topics can be used as features to improve the interpretability of a supervised learning model for the detection of malicious domain names. To our knowledge this is the first large-scale empirical analysis of the co-occurrence patterns of words within domain names.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Deconstructing Domain Names to Reveal Latent Topics

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

BTM: Topic Modeling over Short Texts
Xueqi Cheng ... Yanyan Lan
IEEE Transactions on Knowledge and Data Engineering | VOL. 26
Xueqi Cheng, et. al.Xueqi Cheng ... Yanyan Lan
01 Dec 2014
IEEE Transactions on Knowledge and Data Engineering | VOL. 26

Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling
Belal Abdullah Hezam Murshed ... Jemal Abawajy
IEEE Access | VOL. 10
Belal Abdullah Hezam Murshed, et. al.Belal Abdullah Hezam Murshed ... Jemal Abawajy
01 Jan 2021
IEEE Access | VOL. 10

Lehrbuch Versorgungsforschung: Systematik – Methodik – Anwendung
-
Krankenhaus-Hygiene + Infektionsverhutung | VOL. 39
--
27 Jul 2017
Krankenhaus-Hygiene + Infektionsverhutung | VOL. 39

FastBTM: Reducing the sampling time for biterm topic model
Xingwei He ... Linlin Yu
Knowledge-Based Systems | VOL. 132
Xingwei He, et. al.Xingwei He ... Linlin Yu
06 Jun 2017
Knowledge-Based Systems | VOL. 132

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deconstructing Domain Names to Reveal Latent Topics

Abstract

Talk to us

Similar Papers