A novel cross-domain adaptation framework for unsupervised criminal jargon detection via pre-trained contextual embedding of darknet corpus

Liang Ke,Peng Xiao,Xinyu Chen,Shui Yu,Xingshu Chen,Haizhou Wang

doi:10.1016/j.eswa.2023.122715

Abstract

As the regulation on the surface web becomes more stringent, criminals are gradually turning to the darknet markets for illicit operations. Moderating and studying the content on the marketplaces contribute to the combat of criminal forces in the darknet. Nevertheless, to evade the surveillance of law enforcement, jargons are widely used in criminal conversations as a disguise. These jargons misinterpret the meaning of seemingly innocuous words in cryptic ways, creating a huge challenge for criminal investigation. Current research on Chinese jargon detection focuses on keyword matching. However, this approach cannot keep up with the rapid update of new jargons from various domains. To the best of our knowledge, we are the first to conduct Chinese jargons detection research in the darknet markets. Specifically, we design an unsupervised cross-domain adaptation Chinese jargon detection framework (CJD-Framework) integrated with the pre-trained language model. Firstly, six underground markets in Chinese are crawled to build the first dataset of darknet corpus (DC-dataset). Next, a pre-training model based on Chinese word is proposed to extract contextual embeddings for darknet words. Finally, relying on semantic similarity analysis, a cross-corpus framework is constructed to effectively identify Chinese jargons in the darknet. Comprehensive experiments demonstrate the effectiveness and generalizability of the CJD-framework over the state-of-the-art models, with a detection accuracy of 91.5%. The darknet corpus dataset and innovative framework proposed in this research can provide sources and ideas for future analysis of underground crimes in the darknet markets.

Full Text