Abstract

With the freedom offered by the Deep Web, people have the opportunity to express themselves freely and discretely, and sadly, this is one of the reasons why people carry out illicit activities there. In this work, a novel dataset for Dark Web active domains known as crawler-DB is presented. To build the crawler-DB, the Onion Routing Network (Tor) was sampled, and then a web crawler capable of crawling into links was built. The link addresses that are gathered by the crawler are then classified automatically into five classes. The algorithm built in this study demonstrated good performance as it achieved an accuracy of 85%. A popular text representation method was used with the proposed crawler-DB crossed by two different supervised classifiers to facilitate the categorization of the Tor concealed services. The results of the experiments conducted in this study show that using the Term Frequency-Inverse Document Frequency (TF-IDF) word representation with a linear support vector classifier achieves 91% of 5 folds cross-validation accuracy when classifying a subset of illegal activities from crawler-DB, while the accuracy of Naïve Bayes was 80.6%. The good performance of the linear SVC might support potential tools to help the authorities in the detection of these activities. Moreover, outcomes are expected to be significant in both practical and theoretical aspects, and they may pave the way for further research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call