Optimal Algorithms for Preprocessing of Server Side Web Logs Using Parallel Computing and HBASE

M Vithya,Sangaiah Suguna

doi:10.2139/ssrn.3165312

Abstract

Now a day’s use of internet has been increased tremendously. Every day internet users generate 2.5 quintillion bytes of data from various sources, and thus leads to Big data analytics. Web usage mining is the type of web mining activity that involves discovering user access pattern from web log data. Web usage mining has three phases such as Data Preprocessing, Data Discovery and Data Evaluation. In this paper we have mainly focused on Data preprocessing. Data preprocessing is an important phase of Web usage mining required to unstructured, heterogeneous and unwanted (noisy) nature of log data. In general, two types of logs ie., server-side logs and client side logs are used for web usability analysis. Preprocessing consists of four phases, Data Extraction, Data Cleaning, User identification, Session Identification and Path completion. This paper presents a specific data preprocessing case using hadoop tool for Vizhamurasu News site. In this work, server-side logs are considered to experiment the proposed preprocessing algorithms. The existing preprocessing algorithms are efficient but that are not scalable because when we increasing size of log file and also take much more computation time compared to proposed parallel computing techniques.

Full Text