Abstract

Web browsing privacy is a matter of paramount importance for the Internet users. While they try to protect themselves from being monitored by getting advantage of encryption or VPNs, users’ privacy is still unaccomplished, even taking into account the tangled web, with several domains visited at the same time in a single web page, or IP addresses of a cloud provider shared by several sites. In this work, we provide a novel approach to identify user web browsing that only takes into account the IP addresses that the user has connected to and without performing any DNS reverse resolutions. We use this sequence of addresses as an input of different state-of-the-art deep learning models, such as multi-layer perceptron and transformers, which are able to accurately identify which was the website actually visited among Alexa’s World Top 500 most visited domains. Moreover, we have also studied other factors, such as the dependence on the DNS server used to resolve the visited IP addresses, the accuracy for the top domains (e.g., Google, YouTube, Facebook, etc.), data augmentation by packet sampling simulation to improve our results, the impact on packet sampling and the fine-tuning and possible impact of model parameters or the scalability of our approach. We conclude that, using only a 10% of the packets, we can identify the visited website with an accuracy and F1 score between 94% and 95%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call