The internet, while offering extensive services and information, has also become a platform for malicious activities, particularly through harmful websites that threaten cybersecurity. Detecting and classifying these websites is crucial for protecting users from online threats. Traditional detection methods, primarily based on blacklists and signature-based techniques, struggle to match the pace with the dynamic evolving strategies of cybercriminals. Recent advancements in Machine Learning (ML) show promise, though they remain works in progress. This research addressed this challenge by exploring the usage of Natural Language Processing and Machine Learning techniques used to classify websites as benign or malicious. Unlike many existing studies that relied on URL features alone, this study incorporated a more comprehensive feature set, including URL, content, and additional web attributes, which enhanced classification accuracy. Using an imbalanced dataset skewed towards malicious sites, this study solved using SMOTE (Synthetic Minority Over-sampling Technique) the class imbalance problem, improving model performance. Utilized Hashing Vectorizer (HashingV) and TF-IDF (Term Frequency-Inverse Document Frequency), were adopted to transform textual features into their vector representations while PCA (Principal Component Analysis) and truncated Singular Value Decomposition (truncSVD), were then used to optimize feature representation across different dimensions. Five ML classifiers include Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (KNN), and Logistic Regression (LR) were tested for classification, and performance was evaluated using metrics such as precision and recall, accuracy, F1-Score. The results revealed that Random Forest classifier utilizing HashingV recorded the best results, with accuracies of 99.9563% using truncSVD and 99.9561% with PCA.
Read full abstract