Abstract

Supervised machine learning is often used to detect phishing websites. However, the scarcity of phishing data for training purposes limits the classifier's performance. Further, machine learning algorithms are prone to adversarial attacks: small perturbations on attack data can bypass the classifier. These problems make machine learning less effective for phishing detection. We propose two Generative Adversarial Network (GAN) based approaches that synthesize phishing and legitimate samples to mimic real-world websites. Information about real-world datasets is obtained from ten publicly available phishing datasets which are used by the AAE (Adversarial Autoencoder) and WGAN (Wasserstein GAN) for generating synthetic data. Using both real and synthesized data, we demonstrate how to implement classifiers with higher performance and more resistance to adversarial attacks. We propose a set of hypotheses and validate them through experiments to demonstrate: (i) indistinguishability of synthesized samples from actual ones, (ii) susceptibility of classifiers to adversarial attacks, (iii) mitigating adversarial attacks by training on larger datasets that include correctly labeled synthesized samples, and (iv) better performance of classifiers trained on large datasets. Our AAE and WGAN have been trained on a wide range of datasets, making us optimistic about its widespread applicability.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call