Scalable Data Analytics and Machine Learning on the Cloud

Abdallah Salama

doi:10.26083/tuprints-00017625

Abstract

In recent years, cloud computing has become an alternative to on-premise solutions for enterprises to host their IT-stack. The main idea behind cloud computing is to offer remote, bundled IT resources, especially data storage and computing power, without the need for active management by the user. Due to the economy of scale, cloud computing not only often comes with much lower costs than on-premise solutions but also provides users with the ability to scale-up and -down their resources based on their needs. A major building block of enterprise applications today are database management systems (DBMSs), which are used to store and query data about customers, orders, etc. However, at the time of starting this research work in 2013, bringing the DBMSs to the cloud for online analytical processing (OLAP) and online transactional processing (OLTP) workloads was an open issue that needed to be tackled. The main reason was that the DBMS architecture, which was designed in the 1980s, was not able to optimally support the new challenges, such as elasticity and fault-tolerance that were arising when moving DBMSs to the cloud. In the first part of this thesis, we present XDB, which is a new parallel database architecture to support scalable data analytics in the cloud. XDB hence not only implements a new partitioning scheme that supports elastic scalability, but it also implements a fine-grained fault-tolerance cost model that minimizes the total runtime of a query in case of failures. In addition to classical database workloads, deep learning workloads get more and more important in the cloud. Deep learning (DL) on deep neural networks (DNNs) has proven that it can solve complex problems such as image understanding and pattern recognition. However, training DL models can take a considerable amount of time. To speed-up the training process, several modern machine learning frameworks support distributed DL. However, users of these frameworks need manually to define the distribution strategy (i.e. number of workers and parameter servers), which is a long and complex process. This user intensive involvement makes these machine learning frameworks not ready for the cloud yet. In the second part of this thesis, we present XAI as a middleware for scalable machine learning, which runs on top of existing machine learning frameworks. In XAI, we wanted to provide scalable support for artificial intelligence (AI) in the cloud similar to what we did in the first part for DBMS. XAI implements a new approach to automate the distributed deployment of a deep training job, which can optimally choose the number of parameter servers and workers to achieve scalable training in the cloud. In this case, the user of machine learning frameworks no longer spends considerable time to manually set the training and the distribution strategy of the DL job.

Full Text