A Fuzzy Set Theoretic approach to discover user sessions from web navigational data

Waseem Ahmed,Mohammad Fazle Azeemz Mohammad Fazle Azeemz,Zahid Ansari,A Vinaya Babuy

doi:10.1109/raics.2011.6069435

Abstract

Due to the continuous increase in growth and complexity of WWW, web site publishers are facing increasing difficulty in attracting and retaining users. In order to design attractive web sites, designers must understand their users' needs. Therefore analysing navigational behaviour of users is an important part of web page design. Web Usage Mining (WUM) is the application of data mining techniques to web usage data in order to discover the patterns that can be used to analyse the user's navigational behaviour. Preprocessing, knowledge extraction and results analysis are the three main steps of WUM. Due to large amount of irrelevant information present in the web logs, the original log file can not be directly used in the WUM process. During the preprocessing stage of WUM raw web log data is to transformed into a set of user profiles. Each user profile captures a set of URLs representing a user session. This sessionized data can be used as the input for a variety of data mining tasks such as clustering, association rule mining, sequence mining etc. If the data mining task at hand is clustering, the session files are filtered to remove very small sessions in order to eliminate the noise from the data. But direct removal of these small sized sessions may result in loss of a significant amount of information specially when the number of small sessions is large. We propose a “Fuzzy Set Theoretic” approach to deal with this problem. Instead of directly removing all the small sessions below a specified threshold, we assign weights to all the sessions using a “Fuzzy Membership Function” based on the number of URLs accessed by the sessions. After assigning the weights we apply a “Fuzzy c-Mean Clustering” algorithm to discover the clusters of user profiles. In this paper, we provide a detailed review of various techniques to preprocess the web log data including data fusion, data cleaning, user identification and session identification. We also describe our methodology to perform feature selection (or dimensionality reduction) and session weight assignment tasks. Finally we compare our soft computing based approach of session weight assignment with the traditional hard computing based approach of small session elimination.

Full Text