Analysis of Single and Ensemble Machine Learning Classifiers for Phishing Attacks Detection

Oyelakin A.M,Ajiboye I K,Alimi O M,Mustapha I O

doi:10.15282/ijsecs.7.2.2021.5.0088

Oyelakin A.M, Ajiboye I K + Show 2 more

Open Access

https://doi.org/10.15282/ijsecs.7.2.2021.5.0088

Copy DOI

Abstract

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification. The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing URL detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing URL identification compared to the single classifiers.

Highlights

Different methods have been used by attackers to launch phishing-based attacks in networks and the internet space
The confusion matrix contains the individual values of True Positive (TP), True Negative (TN), False Negative (FN), and False Positive (FP)
The mathematical formulae used for obtaining the values of the performance metrics are shown in equations 1, 2, 3 and 4 respectively: Accuracy= (TP+TN)/(TP+TN+FP+FN)

Summary

Introduction

Different methods have been used by attackers to launch phishing-based attacks in networks and the internet space. These phishing techniques are used by cyber criminals for stealing online users’ personal identity as well as financial account credentials [1]. Authors in [1] reported that for the first quarter of 2020 revealed that phishing attacks have risen greatly beyond the past years. Machine learning approaches have been found more suitable for phishing detection compared to signature-based techniques [2]. [2] further argued that there are three major techniques for phishing detection He mentioned context based technique, URL based method and machine learning technique

Objectives

Results

Conclusion