User Identification in the Process of Web Usage Data Preprocessing

Jozef Kapusta,Michal Munk,Martin Drlík,Dominik Halvoník

doi:10.3991/ijet.v14i09.9854

Abstract

If we are talking about user behavior analytics, we have to understand what the main source of valuable information is. One of these sources is definitely a web server. There are multiple places where we can extract the necessary data. The most common ways are to search for these data in access log, error log, custom log files of web server, proxy server log file, web browser log, browser cookies etc. A web server log is in its default form known as a Common Log File (W3C, 1995) and keeps information about IP address; date and time of visit; ac-cessed and referenced resource. There are standardized methodologies which contain several steps leading to extract new knowledge from provided data. Usu-ally, the first step is in each one of them to identify users, users’ sessions, page views, and clickstreams. This process is called pre-processing. Main goal of this stage is to receive unprocessed web server log file as input and after processing outputs meaningful representations which can be used in next phase. In this pa-per, we describe in detail user session identification which can be considered as most important part of data pre-processing. Our paper aims to compare the us-er/session identification using the STT with the identification of user/session us-ing cookies. This comparison was performed concerning the quality of the se-quential rules generated, i.e., a comparison was made regarding generation useful, trivial and inexplicable rules.

Highlights

Public or even private websites are often under various analyzes
We aim to identify the precision of the Session Time Thresholds (STT) method compared with the identification based on cookies
Our aim is to determine the reliability of the user/session identification method using the STT compared with the identification based on cookies

Summary

Introduction

Public or even private websites are often under various analyzes. The purpose of this analyzes multiple, from maintenance planning to structural improvements. All of this analyzes are based on data. This data can be found in specific places. One of the most valuable sources is a web server. These machines contain a large number of different data sets. We will focus on data related to visitors’ behaviour. While the primary web data is used to get knowledge from the web struciJET ‒ Vol 14, No 9, 2019

Objectives

Methods

Results

Discussion

Conclusion