The rapid growth of the Internet has led to an increase in spam activities, posing significant security threats. Spammers use various tactics, such as intentional misspellings, to evade detection systems. Current spam detection mostly relies on semantic analysis, which is inadequate given these sophisticated evasion techniques. This paper addresses the problem by constructing a Chinese Camouflage Spam dataset (CCS) and proposing a novel detection and recognition model: Prompt and Spelling Checking-based BERT (PSC-BERT). BERT, a state-of-the-art language model, has been successful in many NLP tasks but faces limitations in handling misspelled text. PSC-BERT extends BERT by integrating semantic, phonetic, and glyph information from spam texts and employs a novel hard template prompt learning method that unifies text classification and spell-check tasks. In experiments, PSC-BERT outperformed baseline models, showing a 0.81% improvement over Semorph and a 2.01% improvement over LR in binary classification. For multi-class classification, it achieved a 0.95% increase in Macro-F1 and 0.49% in Weighed-F1 over the best baseline. Two sub-datasets from Chinese Camouflage Spam dataset were created for detection and recognition, and extensive analysis confirmed the model’s efficacy, especially in few-shot learning scenarios. Additionally, a case study illustrates the attention mechanism in BERT, providing deeper insights into token in the sequences. In summary, PSC-BERT demonstrates a significant advancement in spam detection capabilities against intentional misspellings problem.
Read full abstract