Abstract

Multidimensional data summarization is a fundamental mechanism to accelerate the computation of machine learning (ML) models. On the other hand, relational DBMSs can scale beyond main memory limits, they can evaluate SQL queries in parallel and they hide complex internal system details. Heeding this motivation, we present a wide spectrum of alternative SQL queries to compute a summarization matrix that significantly accelerates the computation of many ML models in a data science language (e.g. Python). We consider two fundamental storage layouts: horizontal and vertical. Our proposed SQL queries lead to diverse query plans, which in turn yield highly different processing times. We identify storage layout (row vs column) and relational join optimization as two key performance factors. After careful analysis and bechmarking, we recommend two SQL queries that can work across DBMSs. We show UDFs, an extensibility mechanism, despite being faster, they have many disadvantages compared to plain SQL queries (not portable, system-dependent limitations, main memory, manual optimization required). An extensive experimental evaluation shows the pros and cons of our proposed SQL-based solution. Columnar storage provides an order of magnitude performance improvement over row storage. Moreover, SQL queries can match UDF performance on sparse matrices. We show that by exploiting the summarization matrix in Python, the computation of two popular statistical models (Linear Regression and PCA), is much faster than popular Python libraries (on a single machine) and also faster than Apache Spark (in parallel, in-memory solution for big data clusters). We also show our SQL-based solution exhibits linear speedup in parallel processing. In short, the DBMS can act as a backend linear algebra kernel.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.