Large Scale Distributed Data Science from scratch using Apache Spark 2.0

James Shanahan,Liang Dai

doi:10.1145/3041021.3051108

Abstract

Apache Spark is an open-source cluster computing framework. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala, SQL and R (MapReduce has 2 core calls) , and its core data abstraction, the distributed data frame. In addition, it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. With massive amounts of computational power, deep learning has been shown to produce state-of-the-art results on various tasks in different fields like computer vision, automatic speech recognition, natural language processing and online advertising targeting. Thanks to the open-source frameworks, e.g. Torch, Theano, Caffe, MxNet, Keras and TensorFlow, we can build deep learning model in a much easier way. Among all these framework, TensorFlow is probably the most popular open source deep learning library. TensorFlow 1.0 was released recently, which provide a more stable, flexible and powerful computation tool for numerical computation using data flow graphs. Keras is a high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. This tutorial will provide an accessible introduction to large-scale distributed machine learning and data mining, and to Spark and its potential to revolutionize academic and commercial data science practices. It is divided into three parts: the first part will cover fundamental Spark concepts, including Spark Core, functional programming ala map-reduce, data frames, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more; the second part will focus on hands-on algorithmic design and development with Spark (developing algorithms from scratch such as decision tree learning, association rule mining (aPriori), graph processing algorithms such as pagerank/shortest path, gradient descent algorithms such as support vectors machines and matrix factorization. Industrial applications and deployments of Spark will also be presented.; the third part will introduce deep learning concepts, how to implement a deep learning model through TensorFlow, Keras and run the model on Spark. Example code will be made available in python (pySpark) notebooks.

Full Text