Abstract

Data preprocessing is an essential task to prepare suitable target datasets to apply statistical and data mining algorithms. It has become one of the complex segments in Web usage mining due to the massive and unstructured nature of Web server log. The data preprocessing segment in Web usage mining is divided into several phases such as data fusion, data cleaning, user identification, session identification, path completion, and data formatting. This paper focuses on the initial phases of the Web usage mining process, such as data cleaning, user identification, and session identification. Due to the growing size of log data at terabyte and petabyte scale, traditional data preprocessing algorithms fail at scalability points and possess Big Data issues. During the previous few years, the MapReduce framework has evolved as one of the most used parallel programming frameworks for processing Big Data on a cluster of nodes. In this paper, a MapReduce-based data preprocessing algorithm is developed. This algorithm comprises data preprocessing subphases such as data cleaning, user identification, and session identification. Various efficient heuristics are incorporated into existing MapReduce-based data preprocessing algorithm to detect ethical and unethical robots. Further several experiments are performed on a cluster of nodes and found that the proposed MapReduce-based data preprocessing algorithm is efficient and scalable for larger datasets. Moreover, we have also analyzed the impact of robots’ requests on sessions generated in the session identification phase to measure the effectiveness of the proposed approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.