Boosting the phishing detection performance by semantic analysis

Xi Zhang,Xiao-Bo Jin,Zhi-Wei Yan,Yu Zeng,Guang-Gang Geng

doi:10.1109/bigdata.2017.8258030

Abstract

Phishing is increasingly severe in recent years, which seriously threatens the privacy and property security of netizens. Phishing is essentially a counterfeiting of brands. In order to effectively cheat the victim, phishing sites are visually and semantically highly similar to real sites. In recent years, anti-phishing methods based on machine learning are mainstream anti-phishing methods. The effectiveness of the machine learning models hinges on the extracted statistical features. However, the extracted statistical features mainly focus on visual similarity, stealing information and third-party services, which ignore the semantic information of web pages. Therefore, we extract a series of semantic features through word2vec to better describe the features of phishing sites, and further fuse them with other multi-scale statistical features to construct a more robust phishing detection model. The experimental results on the actual data sets show that the majority of phishing websites are effectively identified by only mining the semantic features of word embeddings. The phishing detection models based on fusion features obtained the best detection results, which shows that semantic features and other statistical features have good complementarity. The proposed method provides a promising way for phishing detection in actual Internet environment, which boosts the phishing detection performance effectively.

Full Text