Phishing attacks remain a significant threat in the digital landscape, with cybercriminals constantly developing sophisticated techniques to deceive users into revealing sensitive information. This study presents a robust framework for phishing URLs and website detection utilizing the XG-Boost (Extreme Gradient Boosting) algorithm, known for its superior performance and efficiency in classification tasks. The proposed system focuses on analyzing various features extracted from URLs and webpage content, including lexical, structural, and host-based attributes, to distinguish between legitimate and malicious sites. By leveraging the strengths of the XG-Boost algorithm, the framework aims to improve detection accuracy and reduce false positive rates, thereby enhancing user safety in online environments. The research involves a comprehensive evaluation of the XG-Boost model's performance against several benchmark datasets representative of real-world phishing scenarios. The model is trained on a diverse set of features, with hyperparameter tuning conducted to optimize its predictive capabilities. Results indicate that the XG-Boost based approach achieves a high classification accuracy while demonstrating robustness against imbalanced datasets common in phishing detection tasks. The findings underscore the effectiveness of machine learning techniques in cybersecurity applications and highlight the potential for implementing the XG-Boost algorithm in real-time phishing detection systems to safeguard users from online threats. The proposed system leverages a diverse set of URL-based and website features such as the presence of special characters, domain age, HTTPS usage, and the similarity of the website's content to known phishing sites. These features are collected from URLs and website characteristics to form a feature vector for each website. The XG-Boost algorithm is trained on a labeled dataset consisting of both phishing and legitimate websites, learning to identify patterns indicative of phishing activities. XG-Boost's gradient boosting framework allows for a more accurate classification by combining multiple weak decision trees into a strong predictive model, making it highly suitable for this task. To evaluate the effectiveness of the proposed system, extensive experiments are conducted using publicly available datasets containing thousands of phishing and legitimate URLs. The model's performance is measured in terms of accuracy, precision, recall, and F1 score. The results demonstrate that the XG-Boost-based model outperforms traditional machine learning algorithms, such as decision trees and support vector machines (SVM), in detecting phishing websites with high accuracy and minimal false positives. This highlights the superiority of XG-Boost in handling the imbalanced nature of phishing datasets, where phishing samples are often underrepresented compared to legitimate ones. Key Words: XG-Boost (Extreme Gradient Boosting), Classifier, Features, Phishing, Train, Accuracy.
Read full abstract