Recent studies reported that about half of Web users nowadays are intelligent agents (Web bots). Many bots are impersonators operating at a very high sophistication level, trying to emulate navigational behaviors of legitimate users (humans). Moreover, bot technology continues to evolve which makes bot detection even harder. To deal with this problem, many advanced methods for differentiating bots from humans have been proposed, a large part of which relies on supervised machine learning techniques. In this paper, we propose a novel approach to identify various profiles of bots and humans which combines feature selection and unsupervised learning of HTTP-level traffic patterns to develop a user session classification model. Session clustering is performed with the agglomerative Information Bottleneck (aIB) algorithm, as well as with some other reference algorithms. The model is then used to classify new sessions to one of the profiles and to label the sessions as performed by bots or humans. An extensive experimental study, based on real server log data, demonstrates the ability of aIB clustering to distinguish user profiles and confirms high performance of the classification model in terms of accuracy, F1, recall, and precision.
Read full abstract