Big Data Software

Julián Luengo,Sergio Ramírez-Gallego,Diego García-Gil,Francisco Herrera,Salvador García

doi:10.1007/978-3-030-39105-8_9

Abstract

The advent of Big Data has created the necessity of new computing tools for processing huge amounts of data. Apache Hadoop was the first open-source framework that implemented the MapReduce paradigm. Apache Spark appeared a few years later improving the Hadoop Ecosystem. Similarly, Apache Flink appeared in the last years for tackling the Big Data streaming problem. However, as these frameworks were created for dealing with huge amounts of data, many practitioners will need machine learning algorithms for extracting the knowledge in the data. The success of a Big Data framework is going to be strongly related to its machine learning capability. This is the reason why nowadays these frameworks include a Big Data machine learning library, MLlib in the case of Spark, and FlinkML for Flink. In this chapter, we analyze in depth both MLlib and FlinkML Big Data libraries. We start with a description of Apache Spark MLlib and all of its components. We continue with a description of a Big Data library focused on data preprocessing for Apache Spark, named BigDaPSpark. Next, we provide an extensive analysis of FlinkML, and its included algorithms and utilities. Lastly, we finish with the description of a Big Data streaming library, focused on data preprocessing for Apache Flink, named BigDaPFlink.

Full Text