Abstract

Knowledge of practical methods of processing large amounts of data is akin to black magic. There are many tools and techniques for scalable data processing, in particular, caching (for example, using the mem cached program), replication, partitioning and, of course, Map Reduce/Hadoop. Hadoop is an open source framework that implements the Map Reduce distribution and reduction algorithm, which underlies Google's approach to organizing queries to distributed datasets that make up the Internet. This article is intended for programmers, architects and project managers involved in processing large amounts of data offline. It will describe how to get a copy of Hadoop, how to organize a cluster and how to write analysis programs. We will start by applying Hadoop in the default configuration to solving a few simple tasks, for example, analyzing changes in the frequency of occurrence of words in the corpus of documents; this will help to understand the basic ideas of Hadoop and MapReduce. Next, we will move on to the basic concepts of MapReduce applications developed using Hadoop, and along the way we will study the components of the framework, the application of Hadoop to a wide range of data analysis tasks and numerous examples of Hadoop in action. Keywords: big data, cluster, Hadoop.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call