Abstract

Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. On the other hand, array DBMSs enable scalable computation with large matrices. With that motivation in mind, we propose a parallel array operator, based on a specific form of matrix multiplication, that computes a comprehensive data summarization matrix. By deriving equivalent equations based on the summarization matrix, statistical methods are adapted to work in two phases: (1) Parallel summarization of the data set in one pass; (2) Iteration exploiting the summarization matrix in many intermediate computations. We prove our summarization matrix captures essential statistical properties of the data set and it allows iterative algorithms to work faster in main memory, by decreasing the number of times the data set is scanned, and by reducing the number of CPU operations. Specifically, we show our summarization matrix benefits statistical models, including PCA, linear regression, and variable selection. From a systems perspective, we carefully study the efficient computation of the summarization matrix on the SciDB parallel array DBMS and how to exploit it in the R language statistical system. To achieve best performance, we introduce two specialized array operators for dense and sparse data sets, respectively. We present an experimental evaluation comparing SciDB, R, a columnar DBMS (a fast SQL engine), and Spark (a popular Hadoop system). Our experiments show R working together with SciDB eliminates main memory and performance limitations from R. More importantly, our R+SciDB prototype is significantly faster and more scalable than Spark and the columnar DBMS.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.