Comparison of Logistic Regression and Random Forest using Correlation-based Feature Selection for Phishing Website Detection

Farida Farida,Ali Mustopa

doi:10.32520/stmsi.v12i1.1832

Abstract

The world is currently experiencing mass developments in information technology, especially during the current pandemic, which requires all of us to learn and even work online. They are triggered much crime in the internet world. One of them is stealing internet user data through a fake website built like the original or called a phishing website. In this research , a classification model is needed to detect phishing websites using the best performance from one of the logistic regression and random forest classification algorithms to overcome the rise of phishing websites in cyberspace. Classification performance can be improved using the correlation-based feature selection (CFS) method to select the most influential attribute in detecting web phishing . Based on the test results, applying the logistic regression and random forest classification algorithm in the classification of web phishing resulted in an accuracy of 93.035% and 96.834%. After feature selection with CFS, the accuracy was 92.718% and 97.015%, respectively. On the Testing, There was an increase in accuracy in RandomForest by 0.181% and an insignificant decrease in logistic regression. The test results prove that feature selection with CFS can eliminate redundant attributes and the resulting classification algorithm accuracy is not much different when the details are complete and Random Forest has accuracy better than after using CSF . Keywords : website phis h ing, classification, logistic regression, random forest, correlation-based

Full Text