Processing Using Spark—A Potent of BD Technology

M Venkatesh Saravanakumar,Sabibullah Mohamed Hanifa

doi:10.1007/978-981-13-0550-4_9

Abstract

Processing, accessing, analyzing, securing, and stockpiling of big data are the most core modalities in big data technology, where Spark, is a core processing layer, an open-source cluster (in-memory) computing platform, unified data processing engine, faster and reliable in a cutting-edge analysis for all types of data. It has a potent to join different datasets across multiple disparate data sources. It supports in-memory computing and enables faster query access compared to disk-based engines like Hadoop. Query ID=Q1 Text=Please check and confirm if the author names and initials are correct. This chapter sustains the major potent of processing behind Spark connected contents like Resilient Distributed Datasets (RDDs), scalable Machine Learning libraries (MLlib), Spark incremental Streaming pipeline process, parallel graph computation interface through GraphX, SQL Data frames, SparkSQL (Data processing paradigm supports columnar storage), and Recommendation systems with MlLib. All libraries operate on RDDs as the data abstraction is very easy to compose with any applications. RDDs are a fault-tolerant computing engine (RDDs are the major abstraction and provide explicit support for data-sharing (user’s computations), can capture a wide range of processing workloads and parallel manipulated can be done in the cluster as a fault-tolerant manner). These are exposed through functional programming APIs (or BD-supported languages) like Scala, Python. Chapter also throws the viewpoint on core scalability of Spark to build high-level data processing libraries for the future generation application is involved. To understand and simplify the entire BD tasks, focusing of processing hindsight, insights, foresights by using Spark’s core engine, its members of ecosystem components are explained with a neat interpretable way, is mandatory for data science compilers at this moment. Big contents dive (current big data tools in Spark, cloud storage) of cognizance are explored in this initiative to replace the bottlenecks towards the development of an efficient and comprehend analytics applications.

Full Text