Abstract
Data preprocessing is considered as an important phase of Web usage mining due to unstructured, heterogeneous and noisy nature of log data. Complete and effective data preprocessing insures the efficiency and scalability of algorithms used in pattern discovery phase of Web usage mining. Data preprocessing generally includes the steps-Data fusion, Data cleaning, User identification, Session identification, Path completion etc. Data cleaning is the initial and important step in preprocessing to extract cleaned data for further processing. It is important to apply data extraction before data cleaning on raw log data in analysis of specific time-duration i.e. one day, one week or one month etc. In this paper we have mainly focused on data fusion, data extraction and data cleaning steps of preprocessing and proposed an algorithm for data extraction which extracts log data according to analysis of time duration. This algorithm also sorts log entries according to their date and time which will be further used in prediction of browsing sequence of user. After that we have applied data cleaning algorithm on extracted real Web server log. In data cleaning almost all irrelevant files, irrelevant HTTP methods and wrong HTTP status codes are considered and after experiment it is analyzed that raw log data reduces to almost 80% which shows the importance of initial phases of data preprocessing.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.