Abstract

As the digital world evolves, the risk of valuable information being exposed to unauthorized parties is increasing. One common vulnerability is the use of malicious Uniform Resource Locator (URL), which are fraudulent links spread across various platforms such as social media and emails. Traditional methods of identifying these URLs, such as blacklisting and heuristic search, rely heavily on syntax or keyword matching, but struggle to keep up with the evolving tactics of cyber attackers. Hence, this paper proposes a solution for detecting malicious URLs and their types based on lexical features. Lexical features in a URL refer to the components that convey semantic and lexical meaning. These can include domain names, path lengths, special characters, and other elements that can be analyzed for patterns or anomalies. In our proposed method, we use 23 different lexical features that focus on the semantic and lexical meaning of the URLs. An Exploratory Data Analysis (EDA) is used to filter the most important lexical features that effectively contribute to predictions. With these carefully curated features, we address the problem as a multi-classification task, aiming to assess the performance of three distinct classifiers: Random Forest, which currently stands as the domain's best solution and a pure bagging technique, as well as XG Boost and Light GBM, both of which utilize boosting techniques. With the proposed method, we could achieve over 93\% accuracy for all the three classifiers while Random Forest achieving the best performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.