Abstract

Big Data is a term used for collection of data sets that are large and complex, which is difficult to store and process using traditional data processing applications. The challenges include capturing, storing, searching, sharing, curating, transferring, analyzing and visualization of this data.However, the knowledge obtained from big data can be analyzed for insights that lead to take better strategic decisions in order to improve the performance of any organization on interested subject. Big data is often characterized by three major characteristics voluminous data, accepting variety of data types i.e., data can be structured, unstructured or semi structured and the velocity at which the data must be processed.To store, process and analyze big data at high band width, there are many softwares available however an open source framework called Hadoop Distributed File System (HDFS) from Apache is used successfully because of its advantages over others in distributed environment. To support big data, many tools such as MapReduce, Hbase, HIVE, Sqoop,Pig,ZooKeeper, NoSql, Manhout, Oozie and so on are used. The most popularly used tool for managing storage resources across the cluster is HDFS. Even though HDFS handles large data using clusters with parallel processing it has some challenges. In this paper, the challenges for big data processing are discussed.The most challenging issues are avoiding inefficient usage of HDFS due to delays in scheduling new MapReduce tasks, portability limitations. This paper investigates the root causes of these performance bottlenecks in order to evaluate tradeoffs between portability and performance in the Hadoop distributed filesystem. In order to minimize the limitations, in big data processing various methodologies proposed by various authors is discussed in this paper.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call