Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets

Journal Ijmer ,B Susrutha ,Vamsi Nath J

doi:10.6084/m9.figshare.1065518.v1

Abstract

Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g. point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, where as the SPJ method does not. In a relational database, especially with normalized tables, a significant effort is required to prepare a summary data set that can be used as input for a data mining or statistical algorithm. Most algorithms require as input a data set with a horizontal layout, with several Records and one variable or dimension per column. That is the case with models like clustering, classification, regression and PCA; consult. Each research discipline uses different terminology to describe the data set. In data mining the common terms are point-dimension. Statistics literature generally uses observation-variable. Machine learning research uses instance-feature. This article introduces a new class of aggregate functions that can be used to build data sets in a horizontal layout (denormalized with aggregations), automating SQL query writing and extending SQL capabilities. We show evaluating horizontal aggregations is a challenging and interesting problem and we introduced alternative methods and optimizations for their efficient evaluation. II. MOTIVATION As mentioned above, building a suitable data set for data mining purposes is a time- consuming task. This task generally requires writing long SQL statements or customizing SQL Code if it is automatically generated by some tool. There are two main ingredients in such SQL code: joins and aggregations; we focus on the second one. The most widely- known aggregation is the sum of a column over groups of rows. Some other aggregations return the average, maximum, minimum or row count over groups of rows. There exist many aggregations functions and operators in SQL. Unfortunately, all these aggregations have limitations to build data sets for data mining purposes. The main reason is that, in general, data sets that are stored in a relational database (or a data warehouse) come

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Carlos Ordonez ... Zhibo Chen
IEEE Transactions on Knowledge and Data Engineering | VOL. 24
Carlos Ordonez, et. al.Carlos Ordonez ... Zhibo Chen
01 Apr 2012
IEEE Transactions on Knowledge and Data Engineering | VOL. 24

Workload Optimization by Horizontal Aggregation in SQL for Data Mining Analysis
Prasanna M Rathod ... Prof Dr Anjali B Raut
International Journal of Scientific Research in Computer Science, Engineering and Information Technology | VOL. -
Prasanna M Rathod, et. al.Prasanna M Rathod ... Prof Dr Anjali B Raut
14 Apr 2021
International Journal of Scientific Research in Computer Science, Engineering and Information Technology | VOL. -

Database Transformation to Build Dataset for Generation of Decision Tree and Extended ER Model
Archana A.Chaudhari ... Harmeet Kaur Khanuja
International Journal of Computer Applications | VOL. 118
Archana A.Chaudhari, et. al.Archana A.Chaudhari ... Harmeet Kaur Khanuja
20 May 2015
International Journal of Computer Applications | VOL. 118

Database Transformation to Build Data-Set for Data Mining Analysis - A Review
Archana A Chaudhari ... Harmeet Kaur Khanuja
-
Archana A Chaudhari, et. al.Archana A Chaudhari ... Harmeet Kaur Khanuja
01 Feb 2015
01 Feb 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets

Abstract

Talk to us

Similar Papers