Review of Parallel Computation Strategies for Statistical Service Engines Using R

陶瑞

doi:10.6843/nthu.2013.00413

Abstract

ABSTRACT In enterprise environment, the source data are stored in various forms such as files, database, and streaming data. Currently, analysts conduct data analysis in offline mode using statistical software [5]. In a conventional sequential computer, processing is channeled through one physical location. In a parallel machine, processing can occur simultaneously at many locations and consequently many more computational operations per second should be achievable. Due to the rapidly decreasing cost of processing, memory, and communication, it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally demanding fields [9]. Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories creates a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data intensive management systems—such as MapReduce-based systems—can scale to petabytes of data, but provide insufficient analytical functionality. [1] We are reviewing the statistical model in Lee’s paper [5] which runs in sequential mode executing data given to it by the Application Server. We use Hardoop/MapReduce model as our statistical engine at the back end of our architecture data analysis algorithms (statistical service engine solution) which include two parts; one half is the R statistical analysis system and the other half is the implementation of the Hadoop data management system. This model consists of three components: an R driver process operated by the data analyst, a Hadoop cluster that hosts the data and runs Jaql (and possibly also some R sub-processes), and an R-Jaql bridge that connects these two components [1]. This is to improve the performance of the scalability and the functionality of the statistical jobs sent to it in a cluster or distributed environment. Also, we use the approach of Message Passing Interface (MPI) and Parallel DBMS Computation to support our model of parallel computation. Thus the new system architecture of statistical service engine solution of Lee’s paper is built.

Full Text