Abstract
The World Wide Web has evolved rapidly, incorporating new content types and becoming more dynamic. The contents from a website can be distributed between several servers, and as a consequence, web traffic has become increasingly complex. From a network traffic perspective, it can be difficult to ascertain which websites are being visited by a user, let alone which part of the user's traffic each website is responsible for. In this paper we present a method for identifying the TCP connections involved in the same full webpage download without the need of deep packet inspection. This identification is needed for example to enable free browsing of specific websites in a pay per use mobile Internet access. It could be not only for third party promoted websites but also portals to gubernamental or medical emergency websites. The proposal is based on a modification of the DBSCAN clustering algorithm to work online and over one-dimensional sorted data. In order to validate our results we use both real traffic and packet captures from a controlled environment. The proposal achieves excellent results in consistency (99%) and completeness (92%), meaning that its error margin identifying the webpage downloads is minimal.
Highlights
The web is probably the Internet application that has grown and evolved the most during the past two decades
In this paper we address this problem by presenting a method capable of identifying individual full webpage downloads by clustering related connections together in real time
After performing a thorough characterization of these captures and testing different approaches to our problem, we present a method based on the DBSCAN (Ester et al, 1996) clustering algorithm which was designed for density-based clustering in noisy databases
Summary
The web is probably the Internet application that has grown and evolved the most during the past two decades. Services like e-mail, video streaming, on-line games or e-learning are, in many cases, provided through the web, taking advantage of the fact that web browsers are present in almost any network-enabled device and that web traffic usually faces few network restrictions. This ever-increasing popularity of the web has introduced new network requirements which have pushed for improvements in the web application protocols and the development of new techniques, like content distribution networks (CDNs) (Fortino and Mastroianni, 2009). The web has achieved a remarkable flexibility which allows it to provide a huge range of different services, but adding many layers of complexity in order to achieve it
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.