GramBeddings: A New Neural Network for URL Based Identification of Phishing Web Pages Through N-gram Embeddings

Ahmet Selman Bozkir,Firat Coskun Dalgic,Murat Aydos

doi:10.1016/j.cose.2022.102964

Abstract

There has been ever-growing use of Internet and progress within many communication channels such as social media and this escalates the need for rapid and low source demanding phishing detection mechanisms. In this very study, we propose a new deep neural model for phishing URL identification so-called GramBeddings introducing some distinguishing novelties by (1) proposing the use of n-gram embeddings, computed on the fly, requiring no pre-training stage, (2) removing the necessity of word and sub-word level information, (3) providing a smart and efficient n-gram selection pipeline, and benefiting from attention mechanism. Other than that, we share a publicly available, large-scale and novel dataset22https://web.cs.hacettepe.edu.tr/~selman/grambeddings-dataset/ including 800K real-world phishing and legitimate URLs. Our scheme suggests an adjustable and automated n-gram selection and filtering mechanism along with a new neural network architecture concatenating four-channel information flow through cascading CNN, LSTM, and attention layers. With that, discriminative multi-level character patterns can be discovered without any hand-crafted operation and are enabled to contribute to prediction. As a result, the proposed system provides the following features in the problem domain: (i) real-time, end-to-end and high performance inference, (ii) language-agnostic prediction, and (iii) removal of the necessity of any third-party service or hand-crafted feature. These experiments show that our approach outperforms the other models in the literature with an accuracy of 98.27%. Moreover, the comparative study conducted with several datasets clearly verifies the superiority of our model in all tests. We also examine the robustness of our model against a real-world adversarial attack and discuss the methods of overcoming such an attack. Our codebase33The code and our supplementary material will be made available at https://www.github.com/fcdalgic/GramBeddings is shared with the community to be used for benchmarking purposes in the future.

Full Text