Amounts Of Data In Parallel Research Articles

SummaryBig Data has become one of the major areas of research for cloud service providers due to a large amount of data produced every day and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. Big Data with its characteristics such as volume, variety, and veracity (3V) requires efficient technologies to process in real time. To solve this problem and to process and analyze this vast amount of data, there are many powerful tools like Hadoop and Spark, which are mainly used in the context of Big Data. They work following the principles of parallel computing. The challenge is to specify which Big Data's tool is better depending on the processing context. In this paper, we present and discuss a performance comparison between two popular Big Data frameworks deployed on virtual machines. Hadoop MapReduce and Apache Spark are used to efficiently process a vast amount of data in parallel and distributed mode on large clusters, and both of them suit for Big Data processing. We also present the execution results of Apache Hadoop in Amazon EC2, a major cloud computing environment. To compare the performance of these two frameworks, we use HiBench benchmark suite, which is an experimental approach for measuring the effectiveness of any computer system. The comparison is made based on three criteria: execution time, throughput, and speedup. We test Wordcount workload with different data sizes for more accurate results. Our experimental results show that the performance of these frameworks varies significantly based on the use case implementation. Furthermore, from our results we draw the conclusion that Spark is more efficient than Hadoop to deal with a large amount of data in major cases. However, Spark requires higher memory allocation, since it loads the data to be processed into memory and keeps them in caches for a while, just like standard databases. So the choice depends on performance level and memory constraints.

Objectives: File structure and storage becomes a challenging issue while processing huge amount of data in parallel and distributed environment. To increase processing capabilities an appropriate file structure must be implemented. Methods/Statistical Analysis: In our approach we have imported the data from the available relational databases like Oracle or MySql to Hive using Sqoop and analyzed the query processing based upon different file storage formats. We have focused on the Parquet, Sequence, RC file and ORC file format for query analysis in MapReduce framework on top of Hadoop. Findings: Understanding dynamic behavior of user buying habits in different web services and product recommendation using social media, e-marketing etc. the MapReduce based data warehousing system plays vital role to perform the Big Data analytic in a parallel and distributed environment. In such type of analysis the data structure used to store the data for parallel query processing effect the performance of Big Data warehouse system. During the analysis of huge amount of relational data in a parallel and distributed system few issues should be taken care to increase the query performance and optimization. These are 1. Faster loading of huge amount of relational data into the Big Data warehouse. 2. Optimized file format to efficiently manage the storage system. 3. Faster query processing by increasing the throughput. Our findings explained appropriate file formats to store the huge amount of relational data in the Big Data warehouse system based upon HDFS and MapReduce framework known as Hive and evaluated the performance of query processing in multi node Hadoop cluster. Application/Improvements: The cost of parallel query processing has been reduced as well as distributed storage efficiency increased by choosing appropriate file structure in Big Data warehouse systems.

Amounts Of Data In Parallel Research Articles

Related Topics

Articles published on Amounts Of Data In Parallel

Optimal Device Selection in Federated Learning for Resource-Constrained Edge Networks

IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop

Multi-dimensional data analysis technology of business application system based on Spark framework

A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

Smart healthcare framework for ambient assisted living using IoMT and big data analytics techniques

Ternary content-addressable memory with MoS2 transistors for massively parallel data search

Many-to-many and Completely Parallel-data-free Voice Conversion Based on Eigenspace DNN

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Neural machine translation for low-resource languages without parallel corpora

Using machine learning to optimize parallelism in big data applications

Relational Query Optimization Technique using Space Efficient File Formats of Hadoop for the Big Data Warehouse System

Design and development of a medical big data processing system based on Hadoop.

Single-scan: a fast star-join query processing algorithm

Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation.

A HIGH SPEED VLSI ARCHITECTURE FOR DIGITAL SPEECH WATERMARKING WITH COMPRESSION

SmartJoin: a network-aware multiway join for MapReduce

Simulation and Data-Processing Framework for Hybrid Synthetic Aperture THz Systems Including THz-Scattering

Traffic classification combining flow correlation and ensemble classifier

Large-Scale Data Processing Using MapReduce in Cloud Computing Environment

Survey on Task Assignment Techniques in Hadoop

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Amounts Of Data In Parallel Research Articles

Related Topics

Articles published on Amounts Of Data In Parallel

Optimal Device Selection in Federated Learning for Resource-Constrained Edge Networks

IMapC: Inner MAPping Combiner to Enhance the Performance of MapReduce in Hadoop

Multi-dimensional data analysis technology of business application system based on Spark framework

A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

Smart healthcare framework for ambient assisted living using IoMT and big data analytics techniques

Ternary content-addressable memory with MoS2 transistors for massively parallel data search

Many-to-many and Completely Parallel-data-free Voice Conversion Based on Eigenspace DNN

Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks

Neural machine translation for low-resource languages without parallel corpora

Using machine learning to optimize parallelism in big data applications

Relational Query Optimization Technique using Space Efficient File Formats of Hadoop for the Big Data Warehouse System

Design and development of a medical big data processing system based on Hadoop.

Single-scan: a fast star-join query processing algorithm

Handling Big Data in Medical Imaging: Iterative Reconstruction with Large-Scale Automated Parallel Computation.

A HIGH SPEED VLSI ARCHITECTURE FOR DIGITAL SPEECH WATERMARKING WITH COMPRESSION

SmartJoin: a network-aware multiway join for MapReduce

Simulation and Data-Processing Framework for Hybrid Synthetic Aperture THz Systems Including THz-Scattering

Traffic classification combining flow correlation and ensemble classifier

Large-Scale Data Processing Using MapReduce in Cloud Computing Environment

Survey on Task Assignment Techniques in Hadoop