PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining

Sanjay Goil,Alok Choudhary

doi:10.1006/jpdc.2000.1691

Sanjay Goil, Alok Choudhary

PDF Available

https://doi.org/10.1006/jpdc.2000.1691

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Multidimensional analysis and online analytical processing (OLAP) operations require summary information on multidimensional data sets. Most common are aggregate operations along one or more dimensions of numerical data values. Simultaneous calculation of multidimensional aggregates are provided by the Data Cube operator, used to calculate and store summary information on a number of dimensions. This is computed only partially if the number of dimensions is large. Query processing for these applications requires different views of data to gain insight and for effective decision support. Queries may either be answered from a materialized cube in the data cube or calculated on the fly. The multidimensionality of the underlying problem can be represented both in relational and in multidimensional databases, the latter being a better fit when query performance is the criteria for judgment. Relational databases are scalable in size for OLAP and multidimensional analysis and efforts are on to make their performance acceptable. On the other hand multidimensional databases have proven to provide good performance for such queries, although they are not very scalable. In this article we address (1) scalability in multidimensional systems for OLAP and multidimensional analysis and (2) integration of data mining with the OLAP framework. We describe our system PARSIMONY, parallel and scalable infrastructure for multidimensional online analytical processing, used for both OLAP and data mining. Sparsity of data sets is handled by using chunks to store data either as a dense block using multidimensional arrays or as sparse representation using a bit encoded sparse structure. Chunks provide a multidimensional index structure for efficient dimension oriented data accesses much the same as multidimensional arrays do. Operations within chunks and between chunks are a combination of relational and multidimensional operations depending on whether the chunk is sparse or dense. Further, we develop parallel algorithms for data mining on the multidimensional cube structure for attribute-oriented association rules and decision-tree-based classification. These take advantage of the data organization provided by the multidimensional data model. Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability.

Full Text