Abstract

Big Data is an emerging growing dataset beyond the ability of a traditional database tool. Hadoop rides the big data where the massive quantity of information is processed using cluster of commodity hardware. Web server logs are semi-structured files generated by the computer in large volume usually of flat text files. It is utilized efficiently by Mapreduce as it process one line at a time. This paper performs the session identification in log files using Hadoop in a distributed cluster. Apache Hadoop Mapreduce a data processing platform is used in pseudo distributed mode and in fully distributed mode. The framework effectively identifies the session utilized by the web surfer to recognize the unique users and pages accessed by the users. The identified session is analyzed in R to produce a statistical report based on total count of visit per day. The results are compared with non-hadoop approach a java environment, and it results in a better time efficiency, storage and processing speed of the proposed work.

Highlights

  • A data is a collection of facts from the grids of web servers usually of unorganized form in the digital universe

  • The web server logs are mined for efficient session identification using Hadoop Mapreduce

  • The NASA web server logs gathered in four different files are used for processing in hadoop environment

Read more

Summary

INTRODUCTION

A data is a collection of facts from the grids of web servers usually of unorganized form in the digital universe. The volume of data becomes larger day by day as the usage of World Wide Web makes an interdisciplinary part of human activities Rise of these data leads to a new technology such as big data that acts as a tool to process, manipulate and manage very large dataset along with the storage required. Big data is distinct from large existing database which uses Hadoop framework for data intensive distributed applications. Sayalee Narkhede et al, [5] introduced the Hadoop-MR log file analysis tool that provides a statistical report on total hits of a web page, user activity, traffic sources. The tweets are stored in Hbase using Hadoop cluster through Rest Calls and text mining algorithms are processed for data analysis. The identified session is analyzed based on date and number of times visited using R tool

HADOOP MAPREDUCE
LOG MINING USING HADOOP APPROACH
RESULTS AND INTERPRETATIONS
Pseudo Distributed Mode Hadoop framework consist of five daemons namely
Fully distributed mode
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call